[Lease.Azure] Unexpected cluster instability #1251

Arkatufus · 2023-01-26T14:01:00Z

Version Information
Version of Akka.NET? 1.4.48
Which Akka.NET Modules? Akka.Coordination.Azure 1.0.0-beta2

Describe the bug

After upgrading to these module versions:

Sharded cluster were unstable because the lease state were corrupted:

Expected behavior
Sharded cluster should be stable, lease should work as intended.

Actual behavior
Sharded cluster were unstable

wesselkranenborg · 2023-01-27T07:19:04Z

Can the issue be that I'm still using <PackageReference Include="Akka.Management.Cluster.Bootstrap" Version="0.3.0-beta4" />? See that there is no 1.0.0 version of that is but that there is now an Akka.Management 1.0.0.

I'm now changing that to see if that changes the behavior.

Arkatufus · 2023-01-27T14:53:46Z

That could be the problem, yes. Please let us know if that fixes the problem.

wesselkranenborg · 2023-01-30T07:16:27Z

First impression is that the cluster is stable for now (but the issues back then only occurred after x hours) but there are still errors in the logs. It are errors with this blob file:

https://.blob.core.windows.net/akka-coordination-lease/system-state-service-singleton-akkasystem-state-servicesystemsh

The errors are also pretty consistent, these are the failed dependency calls from the last 12 hrs:

This is a stacktrace which is shown in our Application Insights:

Azure.RequestFailedException: The condition specified using HTTP conditional header(s) is not met.
RequestId:ff493f7a-101e-0058-1d7a-343ff6000000
Time:2023-01-30T07:13:37.6149868Z
Status: 412 (The condition specified using HTTP conditional header(s) is not met.)
ErrorCode: ConditionNotMet

Content:
<?xml version="1.0" encoding="utf-8"?><Error><Code>ConditionNotMet</Code><Message>The condition specified using HTTP conditional header(s) is not met.
RequestId:ff493f7a-101e-0058-1d7a-343ff6000000
Time:2023-01-30T07:13:37.6149868Z</Message></Error>

Headers:
Server: Windows-Azure-Blob/1.0,Microsoft-HTTPAPI/2.0
x-ms-request-id: ff493f7a-101e-0058-1d7a-343ff6000000
x-ms-client-request-id: 6b6dc170-e646-4716-a97f-f220b550602e
x-ms-version: 2021-10-04
x-ms-error-code: ConditionNotMet
Date: Mon, 30 Jan 2023 07:13:36 GMT
Content-Length: 252
Content-Type: application/xml

   at Azure.Storage.Blobs.BlockBlobRestClient.UploadAsync(Int64 contentLength, Stream body, Nullable`1 timeout, Byte[] transactionalContentMD5, String blobContentType, String blobContentEncoding, String blobContentLanguage, Byte[] blobContentMD5, String blobCacheControl, IDictionary`2 metadata, String leaseId, String blobContentDisposition, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 tier, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, String blobTagsString, Nullable`1 immutabilityPolicyExpiry, Nullable`1 immutabilityPolicyMode, Nullable`1 legalHold, Byte[] transactionalContentCrc64, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.BlockBlobClient.UploadInternal(Stream content, BlobHttpHeaders blobHttpHeaders, IDictionary`2 metadata, IDictionary`2 tags, BlobRequestConditions conditions, Nullable`1 accessTier, BlobImmutabilityPolicy immutabilityPolicy, Nullable`1 legalHold, IProgress`1 progressHandler, UploadTransferValidationOptions transferValidationOverride, String operationName, Boolean async, CancellationToken cancellationToken)

These errors started to happen after the upgrade. In Akka.Coordination.Azure 0.3.0-beta4 (we still use that in production) these errors are not shown in the logs.

Arkatufus · 2023-01-30T14:40:24Z

We'll look into this, thanks!

Arkatufus · 2023-01-30T19:36:32Z

@wesselkranenborg updated #1256 to fix this, it appears that the Azure driver adds implicit preconditions implicitly even when we did not add any

wesselkranenborg · 2023-01-30T21:12:14Z

Good find! So basically it was only noise in the system which is gone in version 1.0.1?

Arkatufus · 2023-01-31T13:59:34Z

Its worse than a noise, I think. Since it is leaking exceptions, it might have unforseen consequences

wesselkranenborg · 2023-02-01T09:58:45Z

It was indeed worse then noise after investigation. It made our cluster completely unstable and now the exceptions are gone and processing of data is going good again. On our integration environment you can clearly see in the number of processed (outgoing) messages that it's stable again 😀 .

Thanks for solving this. We'll test it a bit more and then we can move towards production with it.

wesselkranenborg · 2023-02-01T13:59:20Z

We still get (during deployments where a rebalance of the cluster happens) the following exceptions which we cannot explain:

During this period we also see Akka Deadletters in our logs on these shards.

Arkatufus · 2023-02-01T15:17:36Z

@wesselkranenborg Can I have a version list of all Akka related NuGet packages used in that build?

wesselkranenborg · 2023-02-01T15:37:00Z

Sure, here you go:

    <PackageReference Include="Akka" Version="1.4.49" />
    <PackageReference Include="Akka.Cluster" Version="1.4.49" />
    <PackageReference Include="Akka.Cluster.Hosting" Version="1.0.2" />
    <PackageReference Include="Akka.Cluster.Sharding" Version="1.4.49" />
    <PackageReference Include="Akka.Coordination.Azure" Version="1.0.1" />
    <PackageReference Include="Akka.Discovery.Azure" Version="1.0.1" />
    <PackageReference Include="Akka.HealthCheck.Cluster" Version="1.0.0" />
    <PackageReference Include="Akka.HealthCheck.Hosting.Web" Version="1.0.0" />
    <PackageReference Include="Akka.HealthCheck.Persistence" Version="1.0.0" />
    <PackageReference Include="Akka.DependencyInjection" Version="1.4.49" />
    <PackageReference Include="Akka.Management" Version="1.0.1" />
    <PackageReference Include="Akka.Persistence.Azure" Version="0.9.2" />
    <PackageReference Include="Akka.Persistence.Azure.Hosting" Version="0.9.2" />
    <PackageReference Include="Akka.Serialization.Hyperion" Version="1.4.49" />
    <PackageReference Include="Petabridge.Cmd.Cluster" Version="1.2.2" />
    <PackageReference Include="Petabridge.Cmd.Cluster.Sharding" Version="1.2.2" />
    <PackageReference Include="Petabridge.Cmd.Remote" Version="1.2.2" />

Arkatufus · 2023-02-01T16:19:18Z

@wesselkranenborg do you use the Azure lease specifically for sharding, or do you also use it for the split brain resolver?

wesselkranenborg · 2023-02-01T18:26:10Z

For both, just like described here https://github.com/akkadotnet/Akka.Management/tree/dev/src/coordination/azure/Akka.Coordination.Azure#enable-in-sbr-using-akkaclusterhosting

wesselkranenborg · 2023-02-01T19:08:20Z

@Arkatufus : we use this configuration to be precise:

Arkatufus · 2023-02-01T19:28:28Z

@wesselkranenborg can you give this code a try?

if(akkaSettings.UseakkaCoordinationLease)
{
  // Option instance
  var leaseOptions = new AzureLeaseOptions
  {
    AzureCredential = credentials,
    ServiceEndpoint = storageSettings.BlobStorageEndpoint,
    ContainerName = "akka-coordination-lease",
    HeartbeatTimeout = TimeSpan.FromSeconds(45)
  }

  clusterOptions.SplitBrainResolver = new LeaseMajorityOption
  {
    LeaseImplementation = leaseOptions,
    LeaseName = $"{serviceName}-akka-sbr"
  }

  configurationBuilder
    .WithClustering(clusterOptions)
    .WithAzureLease(leaseOptions)
    .WithSingleton<SingletonActorKey>(
      singletonName: "mySingleton",
      actorProps: mySingletonActorProps
      options: new ClusterSingletonOptions(
        LeaseImplementation = leaseOptions
      )
    );
}

The main point is that all of the LeaseImplementation option points to the same option instance, and I think you can drop the custom hocon at the end.

wesselkranenborg · 2023-02-01T20:07:16Z

That's actually a good point, see now that these two have complete different options:

Deployment to our integration environment is active now. I'll report tomorrow on the results of the test suite and stability over night.

wesselkranenborg · 2023-02-02T06:07:45Z

After updating this I still see this in the logging this morning. At this moment a rebalance was not even happening (we though it was related to that)

I also see these warning traces:

[23/02/02-05:35:37.7415][akka.tcp://system-state-service@10.0.8.252:12552/user/AzureLease22][akka://system-state-service/user/AzureLease22][0048]: Lease system-state-service-shard-heartbeat-shard-248 requested by client system-state-service@10.0.8.252:12552 is already owned by client. Previous lease was not released due to ungraceful shutdown. Lease is still within timeout so granting immediately

So it looks like the new lease is granted but there might be a hickup during these periodes, or am I wrong on that?

wesselkranenborg · 2023-02-02T06:15:31Z

During the night it happened on more shards. We have 2 * 300 shards.

Can it be that the Azure Storage cannot update the leases fast enough resulting in the timeout? Should we set the timeout higher?

Arkatufus · 2023-02-08T14:26:22Z

That is something I have no experience about as everything was tested using Azurite.

One possible cause is that Windows limits the number of open ports that are available to a single process.
Assuming that there are roughly 10 shards per node, that would mean 10 HTTP connections per node, which is not as bad as it seems. We would need to add the number of Remoting calls of each of these nodes. Are there any unreachable node inside the log when these lease outage are happening?

The other possible cause is slow Azure connectivity and/or saturated Azure connection, this is something that needs to be tested using a benchmark because it is a total unknown.

wesselkranenborg · 2023-02-08T14:31:12Z

There are not unreachable nodes during these 'outages', maybe it is already fixed by the PR which just got merged back. I'm happy to try that out to see if that fixes some issues in our cluster.

Arkatufus · 2023-02-08T14:38:25Z

Are you running your cluster inside Kubernetes? If you are, a Kubernetes lease would make more sense because it will be running in local network.

wesselkranenborg · 2023-02-08T14:38:57Z

No, in Azure Container Apps

Arkatufus · 2023-02-08T14:42:10Z

Ok, we'll do another release so you can test the new changes.

wesselkranenborg · 2023-02-08T20:52:33Z

@Arkatufus
After upgrading I see an error which we see more often happen

Rejected to persist event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardRegionRegistered] with sequence number [5835] for persistenceId [/system/sharding/heartbeat-shardCoordinator/singleton/coordinator] due to [0:The specified entity already exists.
RequestId:85f090e6-f002-0068-74fb-3bc317000000
Time:2023-02-08T20:28:39.0189905Z
 The index of the entity that caused the error can be found in FailedTransactionActionIndex.
Status: 409 (Conflict)
ErrorCode: EntityAlreadyExists

Additional Information:
FailedEntity: 0

Content:
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"0:The specified entity already exists.\nRequestId:85f090e6-f002-0068-74fb-3bc317000000\nTime:2023-02-08T20:28:39.0189905Z"}}}

Headers:
X-Content-Type-Options: REDACTED
Cache-Control: no-cache
DataServiceVersion: 3.0;
Content-Type: application/json;odata=minimalmetadata;streaming=true;charset=utf-8
].

I need to fix the cluster by deleting the journal/snapshots of the shardingCoordinator. Might have nothing to do with the Lease (might be more Akka.Persistence.Azure?) issue bug is popping up now again after upgrading this package :-)

wesselkranenborg · 2023-02-08T20:52:53Z

I'll delete those journal tables on our integration setup and see how the cluster forms then.

wesselkranenborg · 2023-02-08T20:54:10Z

Hmm, a redeploy was still busy while I was writing it up and that resolved the issue of this conflict. I'll let it run for a night and see how it behaves.

wesselkranenborg · 2023-02-09T05:49:54Z

@Arkatufus : most of the exceptions are gone during this night. It only seems that the SBR is not working correct with Akka.Coordination.Lease. During the night I saw that I had 3 clusters who were trying to create a cluster with each one node (resulting in a lot of warnings that a lease could not be aquired because it was already taken, but no exception anymore). The cluster creation there failed because I had configured that the number-of-contacts needed to be 3.

In my storage container I also don't see the -sbr lease. I configure my sbr in the following way.

Actual configuration of the SBR (in our build pipeline we run a production build):

var clusterOptions = new ClusterOptions
    {
        Roles = new[] { "core" },
        LogInfo = false
    };

    // lease option instance, please reuse in all shards, singletons, sbr, etc. configuration.
    var leaseOptions = new AzureLeaseOption
    {
        AzureCredential = credentials,
        ServiceEndpoint = storageSettings.BlobStorageEndpoint,
        ContainerName = "akka-coordination-lease",
        ApiServiceRequestTimeout = 10.Seconds(),
        LeaseOperationTimeout = 25.Seconds()
    };

if (builder.Environment.IsDevelopment())
    {
        configurationBuilder
        .WithClustering(clusterOptions);
    }
    else
    {
        clusterOptions.SplitBrainResolver = new LeaseMajorityOption
        {
            LeaseImplementation = leaseOptions,
            LeaseName = $"{serviceName}-akka-sbr",
            Role = "core"
        };

        configurationBuilder
            .WithClustering(clusterOptions)
            .WithAzureLease(leaseOptions);
    }

Dump of the hocon on startup of the app

split-brain-resolver : {
        active-strategy : lease-majority
        stable-after : 20s
        down-all-when-unstable : on
        static-quorum : {
          quorum-size : undefined
          role : 
        }
        keep-majority : {
          role : 
        }
        keep-oldest : {
          down-if-alone : on
          role : 
        }
        lease-majority : {
          lease-implementation : akka.coordination.lease.azure
          lease-name : system-state-service-akka-sbr
          acquire-lease-delay-for-minority : 2s
          release-after : 40s
          role : core
        }
        keep-referee : {
          address : 
          down-all-if-less-than-nodes : 1
        }
      }

When searching in the log for SBR I find these two messages:

[23/02/09-05:37:24.3627][akka.tcp://system-state-service@10.0.8.189:12552/system/cluster/core/daemon/downingProvider][akka://system-state-service/system/cluster/core/daemon/downingProvider][0029]: This node is now the leader responsible for taking SBR decisions among the reachable nodes (more leaders may exist).
[23/02/09-05:37:23.9264][akka.tcp://system-state-service@10.0.8.192:12552/system/cluster/core/daemon/downingProvider][akka://system-state-service/system/cluster/core/daemon/downingProvider][0029]: This node is not the leader any more and not responsible for taking SBR decisions.

But in the storage account I only see leases for the shard-regions, no system-state-service-akka-sbr leasename (as it is configured in the config)

wesselkranenborg · 2023-02-09T15:42:51Z

@Arkatufus After upgrading I see an error which we see more often happen

Rejected to persist event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardRegionRegistered] with sequence number [5835] for persistenceId [/system/sharding/heartbeat-shardCoordinator/singleton/coordinator] due to [0:The specified entity already exists.
RequestId:85f090e6-f002-0068-74fb-3bc317000000
Time:2023-02-08T20:28:39.0189905Z
 The index of the entity that caused the error can be found in FailedTransactionActionIndex.
Status: 409 (Conflict)
ErrorCode: EntityAlreadyExists

Additional Information:
FailedEntity: 0

Content:
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"0:The specified entity already exists.\nRequestId:85f090e6-f002-0068-74fb-3bc317000000\nTime:2023-02-08T20:28:39.0189905Z"}}}

Headers:
X-Content-Type-Options: REDACTED
Cache-Control: no-cache
DataServiceVersion: 3.0;
Content-Type: application/json;odata=minimalmetadata;streaming=true;charset=utf-8
].

I need to fix the cluster by deleting the journal/snapshots of the shardingCoordinator. Might have nothing to do with the Lease (might be more Akka.Persistence.Azure?) issue bug is popping up now again after upgrading this package :-)

This error is more and more happening right now, also in our production cluster. The result of it is that not all discovery points are available in the table storage and then the cluster starts to form separate clusters (split brain kind of scenarios).

This then results in these kind of exceptions in the log and shards being unavailable. A restart of the nodes might solve the issue (temporary).

[23/02/09-15:40:02.7036][akka.tcp://system-state-service@10.0.8.6:12552/system/sharding/staticstate-shardCoordinator/singleton/coordinator][akka://system-state-service/system/sharding/staticstate-shardCoordinator/singleton/coordinator][0034]: Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [2001] for persistenceId [/system/sharding/staticstate-shardCoordinator/singleton/coordinator]

wesselkranenborg · 2023-02-09T17:46:54Z

@Arkatufus might this not be related to this one also? akkadotnet/Akka.Hosting#211

As I've now disable coordination.lease completely and we still see this behavior

Arkatufus · 2023-02-09T20:15:09Z

@wesselkranenborg I'll dedicate my time tomorrow to look into this

wesselkranenborg · 2023-02-10T05:45:30Z

That would be great. I'm more and more convinced that it might be something in the discovery as well.

This is our current Azure Table

While we have 3 running nodes forming a cluster

While this is our discovery setup

And after a restart of the nodes the table is populated again with the nodes which are restarted. But then they don't join the already existing cluster because those records are gone from the table. And then multiple clusters are forming with multiple shard-region singleton coordinators and then I can indeed imagine that journal's already exist or that you get exceptions while reading them, thus leading to a shard-region which doesn't start.

I've also tried the workaround for akkadotnet/Akka.Hosting#211 but that's not helping here

wesselkranenborg · 2023-02-10T05:52:36Z

Just did a restart of our cluster on INT, this is the result:

The cluster now also runs without exceptions in the log and all shards are up and running.
I'll check back in some time and post the same information (and which exceptions are thrown in the meantime).

wesselkranenborg · 2023-02-10T07:14:59Z

Without any exceptions happening this is now the result:

As you can see, the internal IP's are still the same, but... the table for Discovery is empty. If now one of the nodes get a restart it cannot discover the cluster again.

wesselkranenborg · 2023-02-10T10:11:07Z

This is the current state (without exceptions)

But the current replica count is 10, so 3 nodes are not in the cluster and not in the table, they are on-their-own

Arkatufus · 2023-02-10T19:40:16Z

It seems as if all the entries were pruned somehow... let me check the code

Arkatufus · 2023-02-10T19:54:15Z

@wesselkranenborg can I see your Akka.Discovery.Azure hosting code and the relevant HOCON config dump?

wesselkranenborg · 2023-02-10T22:16:51Z

Sure, this is what I can find in the logs (but I see this logmessage is being trimmed because of the length of the configuration):

    management : {
      http : {
        hostname : <hostname>
        port : 8558
        bind-hostname : 
        bind-port : 
        base-path : 
        routes : {
          cluster-bootstrap : "Akka.Management.Cluster.Bootstrap.ClusterBootstrapProvider, Akka.Management"
        }
        route-providers-read-only : true
      }
      cluster : {
        bootstrap : {
          new-cluster-enabled : on
          contact-point-discovery : {
            service-name : system-state-service
            port-name : 
            protocol : tcp
            service-namespace : <service-namespace>
            effective-name : <effective-name>
            discovery-method : akka.discovery
            stable-margin : 5s
            interval : 1s
            exponential-backoff-random-factor : 0.2
            exponential-backoff-max : 15s
            required-contact-point-nr : 1
            resolve-timeout : 3s
            contact-with-all-contact-points : true
          }
          contact-point : {
            fallback-port : <fallback-port>
            filter-on-fallback-port : true
            probing-failure-timeout : 3s
            probe-interval : 1s
            probe-interval-jitter : 0.2
          }
          join-decider : {
            class : "Akka.Management.Cluster.Bootstrap.LowestAddressJoinDecider, Akka.Management"
          }
        }
      }
    }

This is our hosting code:

var credentials = new DefaultAzureCredential();
var serviceName = "system-state-service";
    const int managementPort = 18558;
    var endpointAddress = builder.Environment.IsDevelopment() ? "localhost" : GetPublicIpAddress();

configurationBuilder
.WithAkkaManagement(setup =>
        {
            setup.Http.HostName = endpointAddress;
            setup.Http.BindHostName = endpointAddress;
            setup.Http.Port = managementPort;
            setup.Http.BindPort = managementPort;
        })
        .WithClusterBootstrap(options =>
        {
            options.ContactPointDiscovery.ServiceName = serviceName;
            options.ContactPointDiscovery.RequiredContactPointsNr = 1;
        })

        var storageSettings = serviceProvider.GetRequiredService<IOptions<StorageSettings>>().Value;
        var azureDiscoveryTableName = "AkkaClusterMembersStateServiceCoreTable";
        configurationBuilder
            .WithAzureDiscovery(setup =>
            {
                setup.HostName = endpointAddress;
                setup.Port = managementPort;
                setup.ServiceName = serviceName;
                setup.TableName = azureDiscoveryTableName;
                setup.AzureCredential = credentials;
                setup.AzureTableEndpoint = storageSettings.TableStorageEndpoint;
            });

Arkatufus · 2023-02-10T23:40:30Z

Also, do you have debug enabled in your test application? If so, can you see if something like this in the log?

[system-state-service] 1 row entries pruned:
    [[ClusterMember] ServiceName: ...

wesselkranenborg · 2023-02-11T06:51:47Z

@Arkatufus No, we don't have that enable yet but can try to reproduce it with debug enabled. Which flag do I need to enable?

Arkatufus · 2023-02-11T14:14:41Z

Its the akka.loglevel = DEBUG flag

wesselkranenborg · 2023-02-12T09:44:36Z

@Arkatufus: Yesterday I enabled debug logging and I indeed now see this verbose message:

[23/02/12-09:39:28.5908][AzureDiscoveryGuardian (akka://system-state-service)][akka://system-state-service/deadLetters][0032]: [system-state-service] 1 row entries pruned:
[ClusterMember] ServiceName: system-state-service, Address: 10.0.8.17, Port: 18558] Host: null, Created: 02/12/2023 08:39:22, Last update: 02/12/2023 08:40:22

This is the currenct state of the cluster:

This despite the fact that 3 replicas are running (we we have thus 2 clusters here 'fighting' for the same shard-regions with same persistence configuration).

Arkatufus · 2023-02-13T03:56:56Z

I'm pretty sure that the entries were pruned because each system failed to update their entries, this is related to time out exceptions in the original post.

wesselkranenborg · 2023-02-13T05:53:53Z

That's weird because I don't see any timeout errors in the log. Besides that I see that it suddenly stops logging behavior from the AzureDiscoveryGuardian and an hour after that the row entries are pruned. See this screenshot.

Saterday night around 20.00 the deployment started and logging started to appear but without error it stops suddenly.

Sunday around 9.30 I see that one node got restarted by the cluster and the same behavior started. Logging appeared and suddenly it stops trace logging and one hour later the rows are pruned (again without exceptions or errors in the logs).

I'll try to now first cleanup the cluster (as the journal/snapshot are corrupted now because of multiple clusters writing to the same storage) and try rerun with higher timeouts. I doubt if that will help as the 3 records are pruned a little more then one hour after the deployment, we see exact the same behavior on our dev cluster happening (the logs before are from our integration cluster which has real devices connected and thus a bit more traffic then dev).

wesselkranenborg · 2023-02-13T06:11:37Z

I doubt if the lastupdate column is properly updated. At what interval does that get an update? I'll post some data about that here (my cluster is manually recovered again and now works perfectly after pruning the journal/snapshots of the sharding and a restart of the nodes).

This is my current discovery table. I'll post another one here in 15 minutes:

wesselkranenborg · 2023-02-13T06:26:46Z

The lastupdate column is after almost 15 minutes still not updated (this is the column where according to the code the pruning is based on):

wesselkranenborg · 2023-02-13T06:31:59Z

In the logging I also only see these logmessages (which correspond to the timestamp in above tables).

Same with these logging messages

wesselkranenborg · 2023-02-13T14:00:51Z

@Arkatufus Maybe it's good to also add that during this timeframe I got these messages in the logs (tells something about actors being created succesfully):

And I didn't got any 'Failed to' messages in the logs (error logging from HeartbeatActor):

And exactly one hour after the records are added to the table they got pruned indeed:

[23/02/13-06:59:50.5657][AzureDiscoveryGuardian (akka://system-state-service)][akka://system-state-service/deadLetters][0031]: [system-state-service] 3 row entries pruned:
[ClusterMember] ServiceName: system-state-service, Address: 10.0.8.24, Port: 18558] Host: null, Created: 02/13/2023 05:59:44, Last update: 02/13/2023 06:00:44
[ClusterMember] ServiceName: system-state-service, Address: 10.0.8.43, Port: 18558] Host: null, Created: 02/13/2023 06:01:00, Last update: 02/13/2023 06:02:01
[ClusterMember] ServiceName: system-state-service, Address: 10.0.8.45, Port: 18558] Host: null, Created: 02/13/2023 06:00:23, Last update: 02/13/2023 06:01:23

Arkatufus · 2023-02-13T14:03:09Z

Thank you for a very detailed report, I think I can narrow down the bug now.

wesselkranenborg · 2023-02-13T14:04:32Z

I've a gut feeling that the boolean _updating is not set to false after the first execution in the HeartbeatActor but I cannot pinpoint why.

Arkatufus · 2023-02-13T14:47:55Z

Updates/heartbeats are supposed to happen at 1 minute intervals, while node pruning checks are done every 1 hours. Nodes are pruned if they have not been updated (no heartbeat detected) over the past 5 minutes.

Arkatufus · 2023-02-13T14:50:35Z

So this is definetly a heartbeat problem, the heartbeat actor failed to do its job properly or the guardian actor failing to restart the heartbeat actor when it failed

Arkatufus · 2023-02-13T16:37:49Z

@wesselkranenborg I think I pinned down the bug, it was caused by assuming that PipeTo always returns a value, some of the success calls were not processed correctly, we'll do a patch release to fix it.

Arkatufus mentioned this issue Jan 26, 2023

[Lease.Azure] Async task and options code cleanup #1256

Merged

Arkatufus mentioned this issue Feb 6, 2023

Fix Lease release/acquire operation logic #1289

Merged

Arkatufus mentioned this issue Feb 13, 2023

[Coordination.Azure] Fix lease update and prune bug #1310

Merged

Aaronontheweb closed this as completed in #1310 Feb 13, 2023

[Lease.Azure] Unexpected cluster instability #1251

[Lease.Azure] Unexpected cluster instability #1251

Comments

Arkatufus commented Jan 26, 2023

wesselkranenborg commented Jan 27, 2023 • edited

Arkatufus commented Jan 27, 2023

wesselkranenborg commented Jan 30, 2023 • edited

Arkatufus commented Jan 30, 2023

Arkatufus commented Jan 30, 2023

wesselkranenborg commented Jan 30, 2023

Arkatufus commented Jan 31, 2023

wesselkranenborg commented Feb 1, 2023

wesselkranenborg commented Feb 1, 2023 • edited

Arkatufus commented Feb 1, 2023

wesselkranenborg commented Feb 1, 2023

Arkatufus commented Feb 1, 2023

wesselkranenborg commented Feb 1, 2023

wesselkranenborg commented Feb 1, 2023

Arkatufus commented Feb 1, 2023

wesselkranenborg commented Feb 1, 2023

wesselkranenborg commented Feb 2, 2023 • edited

wesselkranenborg commented Feb 2, 2023 • edited

Arkatufus commented Feb 8, 2023

wesselkranenborg commented Feb 8, 2023

Arkatufus commented Feb 8, 2023

wesselkranenborg commented Feb 8, 2023

Arkatufus commented Feb 8, 2023

wesselkranenborg commented Feb 8, 2023

wesselkranenborg commented Feb 8, 2023

wesselkranenborg commented Feb 8, 2023 • edited

wesselkranenborg commented Feb 9, 2023 • edited

wesselkranenborg commented Feb 9, 2023

wesselkranenborg commented Feb 9, 2023

Arkatufus commented Feb 9, 2023

wesselkranenborg commented Feb 10, 2023 • edited

wesselkranenborg commented Feb 10, 2023 • edited

wesselkranenborg commented Feb 10, 2023

wesselkranenborg commented Feb 10, 2023

Arkatufus commented Feb 10, 2023

Arkatufus commented Feb 10, 2023

wesselkranenborg commented Feb 10, 2023 • edited

Arkatufus commented Feb 10, 2023 • edited

wesselkranenborg commented Feb 11, 2023

Arkatufus commented Feb 11, 2023

wesselkranenborg commented Feb 12, 2023 • edited

Arkatufus commented Feb 13, 2023

wesselkranenborg commented Feb 13, 2023

wesselkranenborg commented Feb 13, 2023 • edited

wesselkranenborg commented Feb 13, 2023

wesselkranenborg commented Feb 13, 2023

wesselkranenborg commented Feb 13, 2023

Arkatufus commented Feb 13, 2023

wesselkranenborg commented Feb 13, 2023

Arkatufus commented Feb 13, 2023

Arkatufus commented Feb 13, 2023

Arkatufus commented Feb 13, 2023

wesselkranenborg commented Jan 27, 2023 •

edited

wesselkranenborg commented Jan 30, 2023 •

edited

wesselkranenborg commented Feb 1, 2023 •

edited

wesselkranenborg commented Feb 2, 2023 •

edited

wesselkranenborg commented Feb 2, 2023 •

edited

wesselkranenborg commented Feb 8, 2023 •

edited

wesselkranenborg commented Feb 9, 2023 •

edited

wesselkranenborg commented Feb 10, 2023 •

edited

wesselkranenborg commented Feb 10, 2023 •

edited

wesselkranenborg commented Feb 10, 2023 •

edited

Arkatufus commented Feb 10, 2023 •

edited

wesselkranenborg commented Feb 12, 2023 •

edited

wesselkranenborg commented Feb 13, 2023 •

edited