Silos no longer stable after upgrading to v3.6.2 #7973

DocIT-Official · 2022-09-12T18:01:32Z

Background:

Been running Orleans in our production SAAS platform for a few years now, we have 8 silo's however 4 out of 8 are the same docker container just some different labels in AKS deployment to load different grain assemblies. the reason for this is that we have real-time OCR processing and that we want that work to run in their own PODS. this has been working with great success for over 2 years. however we upgrade all our nuget packages to v3.6.2 and now we are getting hundreds of pod restarts because pods stop responding to heartbeat while processing work which is causing work to be aborted, I'm looking for some guidance as this behavior is only observed once deployed to AKS, all our integration tests pass and nothing is showing up in insights to make us believe there any unhandled exceptions

ReubenBond · 2022-09-12T20:09:30Z

Which version of Orleans were you using prior to 3.6.2?

DocIT-Official · 2022-09-12T20:44:12Z

3.5.0

ReubenBond · 2022-09-12T20:51:31Z

Did you update any packages other than Orleans in the transition?

DocIT-Official · 2022-09-12T21:23:47Z

Just our own internal nugets and anything that caused package downgrades.

ReubenBond · 2022-09-12T21:33:31Z

I wonder what is causing this.

Do you see log messages complaining of long delays?

You might see improved stability by enabling these two options:

siloBuilder.Configure<ClusterMembershipOptions>(options =>
{
  options.ExtendProbeTimeoutDuringDegradation = true;
  options.EnableIndirectProbes = true;
})

DocIT-Official · 2022-09-12T21:39:54Z

All I see in the log a few seconds after the grain starts processing work, which than as we know will cause eviction which in turn kubernetes will destroy the pod

Orleans.Networking.Shared.SocketConnectionException
Unable to connect to 10.4.16.20:30101. Error: ConnectionRefused

Orleans.Runtime.OrleansMessageRejectionException
Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.4.16.20:30101:400713975. See InnerException ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.4.16.20:30101. Error: ConnectionRefused at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in //src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 52 at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in //src/Orleans.Core/Networking/ConnectionFactory.cs:line 53 at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task1 task, CancellationToken cancellationToken) at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 262 --- End of inner exception stack trace --- at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 286 at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 139 at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 139 at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask1 c, Message m) in //src/Orleans.Runtime/Messaging/OutboundMessageQueue.cs:line 123 at Orleans.Runtime.Messaging.OutboundMessageQueue.g__SendAsync|10_0(ValueTask`1 c, Message m) in //src/Orleans.Runtime/Messaging/OutboundMessageQueue.cs:line 123

ReubenBond · 2022-09-12T21:43:11Z

Which version of .NET are you using? Is it .NET 6? If not, upgrading may help you in the event that you are indeed seeing crippling ThreadPool starvation. Also, what are the settings on your k8s pods: do you have CPU requests/limits set?

By the way, please do try the configuration above. We have since made it the default and will likely make it the default in 3.x at some point

DocIT-Official · 2022-09-12T21:56:39Z

NET6 and NET6-Windows, I'll start posting our startup code and deployment code.

ReubenBond · 2022-09-12T21:59:08Z

You can surround your code blocks with three back ticks

... code...

DocIT-Official · 2022-09-12T21:59:10Z

Our silobuilder

         if (o.EnableCluster || o.EnableSilo)
            {
                bool useKubeHosting = false;
                var clusterConfig = o.OrleansClusterConfiguration;
                webHostBuilder.UseOrleans(siloBuilder =>
                {
                    if (o.EnableCluster && o.EnableDevelopmentCluster == false)
                    {
#if(NET6_0_OR_GREATER)
                        if (!string.IsNullOrEmpty(System.Environment.GetEnvironmentVariable(KubernetesHostingOptions.PodNamespaceEnvironmentVariable)))
                        {
                            siloBuilder.UseKubernetesHosting();
                            useKubeHosting = true;
                        }
#endif
                        switch (clusterConfig.ConnectionConfig.AdoNetConstant.ToLower())
                        {
                            case "system.data.sqlclient":
                                siloBuilder.UseAdoNetClustering(options =>
                                {
                                    options.Invariant = clusterConfig.ConnectionConfig.AdoNetConstant;
                                    options.ConnectionString = clusterConfig.ConnectionConfig.ConnectionString;
                                })
                                .UseAdoNetReminderService(options =>
                                {
                                    options.Invariant = clusterConfig.ReminderConfigs[0].AdoNetConstant;
                                    options.ConnectionString = clusterConfig.ReminderConfigs[0].ConnectionString;
                                })
                                .AddAdoNetGrainStorage(clusterConfig.StorageConfigs[0].Name, options =>
                                {
                                    options.Invariant = clusterConfig.StorageConfigs[0].AdoNetConstant;
                                    options.ConnectionString = clusterConfig.StorageConfigs[0].ConnectionString;
                                });
                                break;

                            case "azurecosmostable":
                                siloBuilder.UseAzureStorageClustering(options =>
                                {
#if NETCOREAPP3_1
                                    options.ConnectionString = clusterConfig.ConnectionConfig.ConnectionString;
#else
                                    options.ConfigureTableServiceClient(clusterConfig.ConnectionConfig.ConnectionString);
#endif
                                })
                                .UseAzureTableReminderService(options =>
                                {
#if NETCOREAPP3_1
                                    options.ConnectionString = clusterConfig.ReminderConfigs[0].ConnectionString;
#else
                                    options.ConfigureTableServiceClient(clusterConfig.ReminderConfigs[0].ConnectionString);
#endif
                                })
                                .AddAzureTableGrainStorage(clusterConfig.StorageConfigs[0].Name, options =>
                                {
#if NETCOREAPP3_1
                                    options.ConnectionString = clusterConfig.StorageConfigs[0].ConnectionString;
#else
                                    options.ConfigureTableServiceClient(clusterConfig.StorageConfigs[0].ConnectionString);
#endif
                                });
                                break;
                        }
                    }
                    else if(o.EnableDevelopmentCluster)
                    {
                        siloBuilder.UseDevelopmentClustering(options =>
                        {
                            var address =
                                clusterConfig.PrimarySiloAddress.Split(new[] { ':' }, StringSplitOptions.RemoveEmptyEntries);
                            options.PrimarySiloEndpoint = new IPEndPoint(IPAddress.Parse(address[0]), Convert.ToInt32(address[1]));
                        }).UseInMemoryReminderService()
                            .AddMemoryGrainStorage("GrainStorage");
                    }
                    siloBuilder
                        .ConfigureLogging((hostingContext, logging) =>
                        {
                            logging.AddConsole();
                            logging.AddDebug();
                            if (!string.IsNullOrEmpty(telemetryKey))
                            {
                                logging.AddApplicationInsights(telemetryKey);
                            }
                            logging.AddSerilog();
                        })
                        .Configure<ClusterOptions>(options =>
                        {
                            if (!useKubeHosting)
                            {
                                options.ClusterId = clusterConfig.ClusterOptions.ClusterId;
                                options.ServiceId = GenerateServiceId(clusterConfig.ClusterOptions.ServiceId);
                            }
                        });

                    if (o.OrleansClusterConfiguration.EndPointOptions.AdvertisedIPAddress?.GetAddressBytes()?.Length > 0)
                    {
                        siloBuilder.ConfigureEndpoints(o.OrleansClusterConfiguration.EndPointOptions.AdvertisedIPAddress, GenerateSiloPortNumber(clusterConfig.EndPointOptions.SiloPort), GenerateGatewayPortNumber(clusterConfig.EndPointOptions.GatewayPort));
                    }
                    else
                    {
                        siloBuilder.ConfigureEndpoints(GenerateSiloPortNumber(clusterConfig.EndPointOptions.SiloPort), GenerateGatewayPortNumber(clusterConfig.EndPointOptions.GatewayPort));
                    }
                    if (Environment.OSVersion.Platform == PlatformID.Win32NT)
                    {
                        siloBuilder.UsePerfCounterEnvironmentStatistics();
                    }
                    else
                    {
                        siloBuilder.UseLinuxEnvironmentStatistics();
                    }
                    siloBuilder.Configure<ClusterMembershipOptions>(options =>
                    {
                        options.ExtendProbeTimeoutDuringDegradation = true;
                        options.EnableIndirectProbes = true;
                    });
                    siloBuilder.Configure<SiloMessagingOptions>(options =>
                        {
                            options.ResponseTimeout = TimeSpan.FromMinutes(30);
                            options.SystemResponseTimeout = TimeSpan.FromMinutes(30);
                        });
                    if (o.GrainAssemblies != null)
                    {
                        o.GrainAssemblies.BindConfiguration(config);
                        siloBuilder.ConfigureApplicationParts(o.GrainAssemblies.DefineApplicationParts);

                    }
                    siloBuilder.Configure<SerializationProviderOptions>(options =>
                    {
                        options.SerializationProviders.Add(typeof(Orleans.Serialization.ProtobufSerializer));
                    });
#if(NET5_0_OR_GREATER)
                    if (o.OrleansClusterConfiguration?.EnableDashboard == true)
                    {
                        siloBuilder.UseDashboard(o =>
                        {
                            o.HostSelf = false;
                            o.HideTrace = true;
                        });
                        siloBuilder.UseDashboardEmbeddedFiles();
                    }
#endif
                });

DocIT-Official · 2022-09-12T22:01:41Z

Deployment

    apiVersion: apps/v1
    kind: Deployment
    metadata:
        name: abc
        labels:
            app: abc
    spec:
        replicas: 1
        selector:
            matchLabels:
                app: abc
        template:
            metadata:
                labels:
                    app: abc
                    orleans/clusterId: $(ClusterId)
                    orleans/serviceId: Document
                    orleans/clusterRole: Document
                    cloudregion: $(CloudRegion)
            spec:
                containers:
                    - name: firmclusteragent
                      image: xxx.azurecr.io/xxxx:$(Build.BuildNumber)
                      imagePullPolicy:
                      ports:
                        - name: silo
                          containerPort: $(GlobalSiloPort)
                          protocol: TCP
                        - name: gateway
                          containerPort: $(GlobalGatewayPort)
                          protocol: TCP
                      env:
                      - name: ORLEANS_SERVICE_ID
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['app']
                      - name: ClusterConfig__ClusterOptions__ServiceId
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['app']
                      - name: ORLEANS_CLUSTER_ID
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['orleans/clusterId']
                      - name: ClusterConfig__ClusterOptions__ClusterId
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['orleans/clusterId']
                      - name: ClusterConfig__ClusterOptions__ServiceId
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['orleans/serviceId']
                      - name: DocIT__ClusterRole
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['orleans/clusterRole']
                      - name: DocIT__CloudRegionCode
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.labels['cloudregion']
                      - name: POD_NAMESPACE
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.namespace
                      - name: POD_NAME
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.name
                      - name: POD_IP
                        valueFrom:
                          fieldRef:
                            fieldPath: status.podIP
                      - name: DOTNET_SHUTDOWNTIMEOUTSECONDS
                        value: "120"
                      request:
                terminationGracePeriodSeconds: 180
                nodeSelector:
                  beta.kubernetes.io/os: windows

ReubenBond · 2022-09-12T22:04:56Z

You said you have multiple silo processes in each container, but you're also using the Kubernetes hosting package, is that right? I wonder how that should work - it's not the scenario that package is designed for (which is one silo per pod)

ReubenBond · 2022-09-12T22:12:54Z

These timeout lengths are concerning. What prompted such long timeouts?

siloBuilder.Configure<SiloMessagingOptions>(options =>
{
  options.ResponseTimeout = TimeSpan.FromMinutes(30);
  options.SystemResponseTimeout = TimeSpan.FromMinutes(30);
});

I don't suppose you're able to profile & share traces of your pods while they are running, eg by collecting traces using dotnet-trace

DocIT-Official · 2022-09-12T22:16:55Z

I just added Kubernetes hosting package to see if would help, right now I'm pulling straws because this behavior has come out of nowhere and we can't roll back the changes back to a previous container as this update required backend database changes. that said let me explain our setup a little clearer.

Our product is designed to run client managed hardware as Windows Services or in AKS. therefore we manipulate startup and load in Grain Assemblies based on configuration

                    if (o.GrainAssemblies != null)
                    {
                        o.GrainAssemblies.BindConfiguration(config); //pass in IConfiguration
                        siloBuilder.ConfigureApplicationParts(o.GrainAssemblies.DefineApplicationParts);

                    }

with this in-place we can deploy a single .exe to our client which will load all the grains in all assemblies or in AKS we can take the same container and split work across 3 deployments.

ReubenBond · 2022-09-12T22:20:32Z

I'm not seeing anything in the diff between 3.5.0 and 3.6.2 which might have caused this. Please try the aforementioned configuration parameters as an attempt to prevent this and give you some respite while we work together to investigate the root cause (which is most likely caused by some kind of thread pool starvation, if I were to guess, but we cannot conclude that without seeing CPU profiling traces or log messages).

Please also remove the Kubernetes integration for now, since it's not suitable for this use case.

DocIT-Official · 2022-09-12T22:22:03Z

These timeout lengths are concerning. What prompted such long timeouts?
siloBuilder.Configure<SiloMessagingOptions>(options =>
{
  options.ResponseTimeout = TimeSpan.FromMinutes(30);
  options.SystemResponseTimeout = TimeSpan.FromMinutes(30);
});
I don't suppose you're able to profile & share traces of your pods while they are running, eg by collecting traces using dotnet-trace

we have a function that aggregates up 10's of thousands of documents (pdf, word, excel etc..) and combines than into a single document. when we rolled out this feature and our clients started using it. we started noticing message response timeouts as some of these jobs would take 20-30 minutes. extending the timeout resolved that issue. we have since moved on and implemented Orleans.SyncWorkers so we could perhaps bring the timeouts back down.

ReubenBond · 2022-09-12T22:24:13Z

we have since moved on and implemented Orleans.SyncWorkers so we could perhaps bring the timeouts back down.

Great! By the way, what timezone are you in and would you prefer to diagnose this on a call?

DocIT-Official · 2022-09-12T22:24:22Z

I'm not seeing anything in the diff between 3.5.0 and 3.6.2 which might have caused this. Please try the aforementioned configuration parameters as an attempt to prevent this and give you some respite while we work together to investigate the root cause (which is most likely caused by some kind of thread pool starvation, if I were to guess, but we cannot conclude that without seeing CPU profiling traces or log messages).

Please also remove the Kubernetes integration for now, since it's not suitable for this use case.

I'll remove the Kubernetes integration.

DocIT-Official · 2022-09-12T22:24:43Z

we have since moved on and implemented Orleans.SyncWorkers so we could perhaps bring the timeouts back down.

Great! By the way, what timezone are you in and would you prefer to diagnose this on a call?

eastern timezone, I can send a team meeting

DocIT-Official · 2022-09-12T23:09:43Z

we have since moved on and implemented Orleans.SyncWorkers so we could perhaps bring the timeouts back down.

Great! By the way, what timezone are you in and would you prefer to diagnose this on a call?

what's the best way for me to send you a teams meeting request? I really appreciate you lending a helping hand.

ReubenBond · 2022-09-13T02:08:02Z

rebond is my alias
microsoft.com is the domain

DocIT-Official · 2022-09-13T02:13:49Z

rebond is my alias microsoft.com is the domain

invite sent, I hope it works for you

ReubenBond · 2022-09-15T18:37:27Z

@DocIT-Official let's try again when you're available

NQ-Brewir · 2022-10-26T15:19:02Z

Hello,
we started to have this problem since we migrated from 3.5.0 to 3.6.2 : dotnet/runtime#72365
at first, as we migrated to arm64 at the same time, we suspected an arm issue, but it might be worth looking both issues together as symptoms are quite similar

ghost added the Needs: triage 🔍 label Sep 12, 2022

rafikiassumani-msft added area-silos category for all the silos related issues and removed Needs: triage 🔍 labels Sep 15, 2022

rafikiassumani-msft assigned ReubenBond Sep 15, 2022

ReubenBond closed this as completed Dec 11, 2023

github-actions bot locked and limited conversation to collaborators Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silos no longer stable after upgrading to v3.6.2 #7973

Silos no longer stable after upgrading to v3.6.2 #7973

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022 •

edited

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022 •

edited by ReubenBond

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 13, 2022

DocIT-Official commented Sep 13, 2022

ReubenBond commented Sep 15, 2022

NQ-Brewir commented Oct 26, 2022

Silos no longer stable after upgrading to v3.6.2 #7973

Silos no longer stable after upgrading to v3.6.2 #7973

Comments

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022 • edited

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022 • edited by ReubenBond

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

DocIT-Official commented Sep 12, 2022

ReubenBond commented Sep 13, 2022

DocIT-Official commented Sep 13, 2022

ReubenBond commented Sep 15, 2022

NQ-Brewir commented Oct 26, 2022

ReubenBond commented Sep 12, 2022 •

edited

DocIT-Official commented Sep 12, 2022 •

edited by ReubenBond