Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lease.Azure] Unexpected cluster instability #1251

Closed
Arkatufus opened this issue Jan 26, 2023 · 52 comments · Fixed by #1310
Closed

[Lease.Azure] Unexpected cluster instability #1251

Arkatufus opened this issue Jan 26, 2023 · 52 comments · Fixed by #1310

Comments

@Arkatufus
Copy link
Contributor

Version Information
Version of Akka.NET? 1.4.48
Which Akka.NET Modules? Akka.Coordination.Azure 1.0.0-beta2

Describe the bug

After upgrading to these module versions:
Versions

Sharded cluster were unstable because the lease state were corrupted:
Exceptions

Expected behavior
Sharded cluster should be stable, lease should work as intended.

Actual behavior
Sharded cluster were unstable

@wesselkranenborg
Copy link

wesselkranenborg commented Jan 27, 2023

Can the issue be that I'm still using <PackageReference Include="Akka.Management.Cluster.Bootstrap" Version="0.3.0-beta4" />? See that there is no 1.0.0 version of that is but that there is now an Akka.Management 1.0.0.

I'm now changing that to see if that changes the behavior.

@Arkatufus
Copy link
Contributor Author

That could be the problem, yes. Please let us know if that fixes the problem.

@wesselkranenborg
Copy link

wesselkranenborg commented Jan 30, 2023

First impression is that the cluster is stable for now (but the issues back then only occurred after x hours) but there are still errors in the logs. It are errors with this blob file:

https://.blob.core.windows.net/akka-coordination-lease/system-state-service-singleton-akkasystem-state-servicesystemsh

image

The errors are also pretty consistent, these are the failed dependency calls from the last 12 hrs:
image

This is a stacktrace which is shown in our Application Insights:

Azure.RequestFailedException: The condition specified using HTTP conditional header(s) is not met.
RequestId:ff493f7a-101e-0058-1d7a-343ff6000000
Time:2023-01-30T07:13:37.6149868Z
Status: 412 (The condition specified using HTTP conditional header(s) is not met.)
ErrorCode: ConditionNotMet

Content:
<?xml version="1.0" encoding="utf-8"?><Error><Code>ConditionNotMet</Code><Message>The condition specified using HTTP conditional header(s) is not met.
RequestId:ff493f7a-101e-0058-1d7a-343ff6000000
Time:2023-01-30T07:13:37.6149868Z</Message></Error>

Headers:
Server: Windows-Azure-Blob/1.0,Microsoft-HTTPAPI/2.0
x-ms-request-id: ff493f7a-101e-0058-1d7a-343ff6000000
x-ms-client-request-id: 6b6dc170-e646-4716-a97f-f220b550602e
x-ms-version: 2021-10-04
x-ms-error-code: ConditionNotMet
Date: Mon, 30 Jan 2023 07:13:36 GMT
Content-Length: 252
Content-Type: application/xml

   at Azure.Storage.Blobs.BlockBlobRestClient.UploadAsync(Int64 contentLength, Stream body, Nullable`1 timeout, Byte[] transactionalContentMD5, String blobContentType, String blobContentEncoding, String blobContentLanguage, Byte[] blobContentMD5, String blobCacheControl, IDictionary`2 metadata, String leaseId, String blobContentDisposition, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 tier, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, String blobTagsString, Nullable`1 immutabilityPolicyExpiry, Nullable`1 immutabilityPolicyMode, Nullable`1 legalHold, Byte[] transactionalContentCrc64, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.BlockBlobClient.UploadInternal(Stream content, BlobHttpHeaders blobHttpHeaders, IDictionary`2 metadata, IDictionary`2 tags, BlobRequestConditions conditions, Nullable`1 accessTier, BlobImmutabilityPolicy immutabilityPolicy, Nullable`1 legalHold, IProgress`1 progressHandler, UploadTransferValidationOptions transferValidationOverride, String operationName, Boolean async, CancellationToken cancellationToken)

These errors started to happen after the upgrade. In Akka.Coordination.Azure 0.3.0-beta4 (we still use that in production) these errors are not shown in the logs.

@Arkatufus
Copy link
Contributor Author

We'll look into this, thanks!

@Arkatufus
Copy link
Contributor Author

@wesselkranenborg updated #1256 to fix this, it appears that the Azure driver adds implicit preconditions implicitly even when we did not add any

@wesselkranenborg
Copy link

Good find! So basically it was only noise in the system which is gone in version 1.0.1?

@Arkatufus
Copy link
Contributor Author

Its worse than a noise, I think. Since it is leaking exceptions, it might have unforseen consequences

@wesselkranenborg
Copy link

It was indeed worse then noise after investigation. It made our cluster completely unstable and now the exceptions are gone and processing of data is going good again. On our integration environment you can clearly see in the number of processed (outgoing) messages that it's stable again 😀 .
image

Thanks for solving this. We'll test it a bit more and then we can move towards production with it.

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 1, 2023

We still get (during deployments where a rebalance of the cluster happens) the following exceptions which we cannot explain:

image

During this period we also see Akka Deadletters in our logs on these shards.

@Arkatufus
Copy link
Contributor Author

@wesselkranenborg Can I have a version list of all Akka related NuGet packages used in that build?

@wesselkranenborg
Copy link

Sure, here you go:

    <PackageReference Include="Akka" Version="1.4.49" />
    <PackageReference Include="Akka.Cluster" Version="1.4.49" />
    <PackageReference Include="Akka.Cluster.Hosting" Version="1.0.2" />
    <PackageReference Include="Akka.Cluster.Sharding" Version="1.4.49" />
    <PackageReference Include="Akka.Coordination.Azure" Version="1.0.1" />
    <PackageReference Include="Akka.Discovery.Azure" Version="1.0.1" />
    <PackageReference Include="Akka.HealthCheck.Cluster" Version="1.0.0" />
    <PackageReference Include="Akka.HealthCheck.Hosting.Web" Version="1.0.0" />
    <PackageReference Include="Akka.HealthCheck.Persistence" Version="1.0.0" />
    <PackageReference Include="Akka.DependencyInjection" Version="1.4.49" />
    <PackageReference Include="Akka.Management" Version="1.0.1" />
    <PackageReference Include="Akka.Persistence.Azure" Version="0.9.2" />
    <PackageReference Include="Akka.Persistence.Azure.Hosting" Version="0.9.2" />
    <PackageReference Include="Akka.Serialization.Hyperion" Version="1.4.49" />
    <PackageReference Include="Petabridge.Cmd.Cluster" Version="1.2.2" />
    <PackageReference Include="Petabridge.Cmd.Cluster.Sharding" Version="1.2.2" />
    <PackageReference Include="Petabridge.Cmd.Remote" Version="1.2.2" />

@Arkatufus
Copy link
Contributor Author

@wesselkranenborg do you use the Azure lease specifically for sharding, or do you also use it for the split brain resolver?

@wesselkranenborg
Copy link

@Arkatufus : we use this configuration to be precise:
image

@Arkatufus
Copy link
Contributor Author

@wesselkranenborg can you give this code a try?

if(akkaSettings.UseakkaCoordinationLease)
{
  // Option instance
  var leaseOptions = new AzureLeaseOptions
  {
    AzureCredential = credentials,
    ServiceEndpoint = storageSettings.BlobStorageEndpoint,
    ContainerName = "akka-coordination-lease",
    HeartbeatTimeout = TimeSpan.FromSeconds(45)
  }

  clusterOptions.SplitBrainResolver = new LeaseMajorityOption
  {
    LeaseImplementation = leaseOptions,
    LeaseName = $"{serviceName}-akka-sbr"
  }

  configurationBuilder
    .WithClustering(clusterOptions)
    .WithAzureLease(leaseOptions)
    .WithSingleton<SingletonActorKey>(
      singletonName: "mySingleton",
      actorProps: mySingletonActorProps
      options: new ClusterSingletonOptions(
        LeaseImplementation = leaseOptions
      )
    );
}

The main point is that all of the LeaseImplementation option points to the same option instance, and I think you can drop the custom hocon at the end.

@wesselkranenborg
Copy link

That's actually a good point, see now that these two have complete different options:
image

Deployment to our integration environment is active now. I'll report tomorrow on the results of the test suite and stability over night.

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 2, 2023

After updating this I still see this in the logging this morning. At this moment a rebalance was not even happening (we though it was related to that)

image

I also see these warning traces:
image

[23/02/02-05:35:37.7415][akka.tcp://system-state-service@10.0.8.252:12552/user/AzureLease22][akka://system-state-service/user/AzureLease22][0048]: Lease system-state-service-shard-heartbeat-shard-248 requested by client system-state-service@10.0.8.252:12552 is already owned by client. Previous lease was not released due to ungraceful shutdown. Lease is still within timeout so granting immediately

So it looks like the new lease is granted but there might be a hickup during these periodes, or am I wrong on that?

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 2, 2023

During the night it happened on more shards. We have 2 * 300 shards.
image

Can it be that the Azure Storage cannot update the leases fast enough resulting in the timeout? Should we set the timeout higher?

@Arkatufus
Copy link
Contributor Author

That is something I have no experience about as everything was tested using Azurite.

One possible cause is that Windows limits the number of open ports that are available to a single process.
Assuming that there are roughly 10 shards per node, that would mean 10 HTTP connections per node, which is not as bad as it seems. We would need to add the number of Remoting calls of each of these nodes. Are there any unreachable node inside the log when these lease outage are happening?

The other possible cause is slow Azure connectivity and/or saturated Azure connection, this is something that needs to be tested using a benchmark because it is a total unknown.

@wesselkranenborg
Copy link

There are not unreachable nodes during these 'outages', maybe it is already fixed by the PR which just got merged back. I'm happy to try that out to see if that fixes some issues in our cluster.

@Arkatufus
Copy link
Contributor Author

Are you running your cluster inside Kubernetes? If you are, a Kubernetes lease would make more sense because it will be running in local network.

@wesselkranenborg
Copy link

No, in Azure Container Apps

@Arkatufus
Copy link
Contributor Author

Ok, we'll do another release so you can test the new changes.

@wesselkranenborg
Copy link

@Arkatufus
After upgrading I see an error which we see more often happen

Rejected to persist event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardRegionRegistered] with sequence number [5835] for persistenceId [/system/sharding/heartbeat-shardCoordinator/singleton/coordinator] due to [0:The specified entity already exists.
RequestId:85f090e6-f002-0068-74fb-3bc317000000
Time:2023-02-08T20:28:39.0189905Z
 The index of the entity that caused the error can be found in FailedTransactionActionIndex.
Status: 409 (Conflict)
ErrorCode: EntityAlreadyExists

Additional Information:
FailedEntity: 0

Content:
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"0:The specified entity already exists.\nRequestId:85f090e6-f002-0068-74fb-3bc317000000\nTime:2023-02-08T20:28:39.0189905Z"}}}

Headers:
X-Content-Type-Options: REDACTED
Cache-Control: no-cache
DataServiceVersion: 3.0;
Content-Type: application/json;odata=minimalmetadata;streaming=true;charset=utf-8
].

I need to fix the cluster by deleting the journal/snapshots of the shardingCoordinator. Might have nothing to do with the Lease (might be more Akka.Persistence.Azure?) issue bug is popping up now again after upgrading this package :-)

@wesselkranenborg
Copy link

I'll delete those journal tables on our integration setup and see how the cluster forms then.

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 8, 2023

Hmm, a redeploy was still busy while I was writing it up and that resolved the issue of this conflict. I'll let it run for a night and see how it behaves.

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 9, 2023

@Arkatufus : most of the exceptions are gone during this night. It only seems that the SBR is not working correct with Akka.Coordination.Lease. During the night I saw that I had 3 clusters who were trying to create a cluster with each one node (resulting in a lot of warnings that a lease could not be aquired because it was already taken, but no exception anymore). The cluster creation there failed because I had configured that the number-of-contacts needed to be 3.

In my storage container I also don't see the -sbr lease. I configure my sbr in the following way.

Actual configuration of the SBR (in our build pipeline we run a production build):

var clusterOptions = new ClusterOptions
    {
        Roles = new[] { "core" },
        LogInfo = false
    };

    // lease option instance, please reuse in all shards, singletons, sbr, etc. configuration.
    var leaseOptions = new AzureLeaseOption
    {
        AzureCredential = credentials,
        ServiceEndpoint = storageSettings.BlobStorageEndpoint,
        ContainerName = "akka-coordination-lease",
        ApiServiceRequestTimeout = 10.Seconds(),
        LeaseOperationTimeout = 25.Seconds()
    };

if (builder.Environment.IsDevelopment())
    {
        configurationBuilder
        .WithClustering(clusterOptions);
    }
    else
    {
        clusterOptions.SplitBrainResolver = new LeaseMajorityOption
        {
            LeaseImplementation = leaseOptions,
            LeaseName = $"{serviceName}-akka-sbr",
            Role = "core"
        };

        configurationBuilder
            .WithClustering(clusterOptions)
            .WithAzureLease(leaseOptions);
    }

Dump of the hocon on startup of the app

split-brain-resolver : {
        active-strategy : lease-majority
        stable-after : 20s
        down-all-when-unstable : on
        static-quorum : {
          quorum-size : undefined
          role : 
        }
        keep-majority : {
          role : 
        }
        keep-oldest : {
          down-if-alone : on
          role : 
        }
        lease-majority : {
          lease-implementation : akka.coordination.lease.azure
          lease-name : system-state-service-akka-sbr
          acquire-lease-delay-for-minority : 2s
          release-after : 40s
          role : core
        }
        keep-referee : {
          address : 
          down-all-if-less-than-nodes : 1
        }
      }

When searching in the log for SBR I find these two messages:

[23/02/09-05:37:24.3627][akka.tcp://system-state-service@10.0.8.189:12552/system/cluster/core/daemon/downingProvider][akka://system-state-service/system/cluster/core/daemon/downingProvider][0029]: This node is now the leader responsible for taking SBR decisions among the reachable nodes (more leaders may exist).
[23/02/09-05:37:23.9264][akka.tcp://system-state-service@10.0.8.192:12552/system/cluster/core/daemon/downingProvider][akka://system-state-service/system/cluster/core/daemon/downingProvider][0029]: This node is not the leader any more and not responsible for taking SBR decisions.

But in the storage account I only see leases for the shard-regions, no system-state-service-akka-sbr leasename (as it is configured in the config)

@wesselkranenborg
Copy link

@Arkatufus After upgrading I see an error which we see more often happen

Rejected to persist event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardRegionRegistered] with sequence number [5835] for persistenceId [/system/sharding/heartbeat-shardCoordinator/singleton/coordinator] due to [0:The specified entity already exists.
RequestId:85f090e6-f002-0068-74fb-3bc317000000
Time:2023-02-08T20:28:39.0189905Z
 The index of the entity that caused the error can be found in FailedTransactionActionIndex.
Status: 409 (Conflict)
ErrorCode: EntityAlreadyExists

Additional Information:
FailedEntity: 0

Content:
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"0:The specified entity already exists.\nRequestId:85f090e6-f002-0068-74fb-3bc317000000\nTime:2023-02-08T20:28:39.0189905Z"}}}

Headers:
X-Content-Type-Options: REDACTED
Cache-Control: no-cache
DataServiceVersion: 3.0;
Content-Type: application/json;odata=minimalmetadata;streaming=true;charset=utf-8
].

I need to fix the cluster by deleting the journal/snapshots of the shardingCoordinator. Might have nothing to do with the Lease (might be more Akka.Persistence.Azure?) issue bug is popping up now again after upgrading this package :-)

This error is more and more happening right now, also in our production cluster. The result of it is that not all discovery points are available in the table storage and then the cluster starts to form separate clusters (split brain kind of scenarios).

This then results in these kind of exceptions in the log and shards being unavailable. A restart of the nodes might solve the issue (temporary).

[23/02/09-15:40:02.7036][akka.tcp://system-state-service@10.0.8.6:12552/system/sharding/staticstate-shardCoordinator/singleton/coordinator][akka://system-state-service/system/sharding/staticstate-shardCoordinator/singleton/coordinator][0034]: Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [2001] for persistenceId [/system/sharding/staticstate-shardCoordinator/singleton/coordinator]

@wesselkranenborg
Copy link

@Arkatufus might this not be related to this one also? akkadotnet/Akka.Hosting#211

As I've now disable coordination.lease completely and we still see this behavior

@Arkatufus
Copy link
Contributor Author

@wesselkranenborg I'll dedicate my time tomorrow to look into this

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 10, 2023

That would be great. I'm more and more convinced that it might be something in the discovery as well.

This is our current Azure Table
image

While we have 3 running nodes forming a cluster
image

While this is our discovery setup
image

And after a restart of the nodes the table is populated again with the nodes which are restarted. But then they don't join the already existing cluster because those records are gone from the table. And then multiple clusters are forming with multiple shard-region singleton coordinators and then I can indeed imagine that journal's already exist or that you get exceptions while reading them, thus leading to a shard-region which doesn't start.

I've also tried the workaround for akkadotnet/Akka.Hosting#211 but that's not helping here

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 10, 2023

Just did a restart of our cluster on INT, this is the result:
image

image

The cluster now also runs without exceptions in the log and all shards are up and running.
I'll check back in some time and post the same information (and which exceptions are thrown in the meantime).

@wesselkranenborg
Copy link

Without any exceptions happening this is now the result:
image
As you can see, the internal IP's are still the same, but... the table for Discovery is empty. If now one of the nodes get a restart it cannot discover the cluster again.
image

@wesselkranenborg
Copy link

This is the current state (without exceptions)
image

image

But the current replica count is 10, so 3 nodes are not in the cluster and not in the table, they are on-their-own

@Arkatufus
Copy link
Contributor Author

It seems as if all the entries were pruned somehow... let me check the code

@Arkatufus
Copy link
Contributor Author

@wesselkranenborg can I see your Akka.Discovery.Azure hosting code and the relevant HOCON config dump?

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 10, 2023

Sure, this is what I can find in the logs (but I see this logmessage is being trimmed because of the length of the configuration):

    management : {
      http : {
        hostname : <hostname>
        port : 8558
        bind-hostname : 
        bind-port : 
        base-path : 
        routes : {
          cluster-bootstrap : "Akka.Management.Cluster.Bootstrap.ClusterBootstrapProvider, Akka.Management"
        }
        route-providers-read-only : true
      }
      cluster : {
        bootstrap : {
          new-cluster-enabled : on
          contact-point-discovery : {
            service-name : system-state-service
            port-name : 
            protocol : tcp
            service-namespace : <service-namespace>
            effective-name : <effective-name>
            discovery-method : akka.discovery
            stable-margin : 5s
            interval : 1s
            exponential-backoff-random-factor : 0.2
            exponential-backoff-max : 15s
            required-contact-point-nr : 1
            resolve-timeout : 3s
            contact-with-all-contact-points : true
          }
          contact-point : {
            fallback-port : <fallback-port>
            filter-on-fallback-port : true
            probing-failure-timeout : 3s
            probe-interval : 1s
            probe-interval-jitter : 0.2
          }
          join-decider : {
            class : "Akka.Management.Cluster.Bootstrap.LowestAddressJoinDecider, Akka.Management"
          }
        }
      }
    }

This is our hosting code:

var credentials = new DefaultAzureCredential();
var serviceName = "system-state-service";
    const int managementPort = 18558;
    var endpointAddress = builder.Environment.IsDevelopment() ? "localhost" : GetPublicIpAddress();

configurationBuilder
.WithAkkaManagement(setup =>
        {
            setup.Http.HostName = endpointAddress;
            setup.Http.BindHostName = endpointAddress;
            setup.Http.Port = managementPort;
            setup.Http.BindPort = managementPort;
        })
        .WithClusterBootstrap(options =>
        {
            options.ContactPointDiscovery.ServiceName = serviceName;
            options.ContactPointDiscovery.RequiredContactPointsNr = 1;
        })

        var storageSettings = serviceProvider.GetRequiredService<IOptions<StorageSettings>>().Value;
        var azureDiscoveryTableName = "AkkaClusterMembersStateServiceCoreTable";
        configurationBuilder
            .WithAzureDiscovery(setup =>
            {
                setup.HostName = endpointAddress;
                setup.Port = managementPort;
                setup.ServiceName = serviceName;
                setup.TableName = azureDiscoveryTableName;
                setup.AzureCredential = credentials;
                setup.AzureTableEndpoint = storageSettings.TableStorageEndpoint;
            });

@Arkatufus
Copy link
Contributor Author

Arkatufus commented Feb 10, 2023

Also, do you have debug enabled in your test application? If so, can you see if something like this in the log?

[system-state-service] 1 row entries pruned:
    [[ClusterMember] ServiceName: ...

@wesselkranenborg
Copy link

@Arkatufus No, we don't have that enable yet but can try to reproduce it with debug enabled. Which flag do I need to enable?

@Arkatufus
Copy link
Contributor Author

Its the akka.loglevel = DEBUG flag

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 12, 2023

@Arkatufus: Yesterday I enabled debug logging and I indeed now see this verbose message:

[23/02/12-09:39:28.5908][AzureDiscoveryGuardian (akka://system-state-service)][akka://system-state-service/deadLetters][0032]: [system-state-service] 1 row entries pruned:
[ClusterMember] ServiceName: system-state-service, Address: 10.0.8.17, Port: 18558] Host: null, Created: 02/12/2023 08:39:22, Last update: 02/12/2023 08:40:22

This is the currenct state of the cluster:
image
This despite the fact that 3 replicas are running (we we have thus 2 clusters here 'fighting' for the same shard-regions with same persistence configuration).

image

@Arkatufus
Copy link
Contributor Author

I'm pretty sure that the entries were pruned because each system failed to update their entries, this is related to time out exceptions in the original post.

@wesselkranenborg
Copy link

That's weird because I don't see any timeout errors in the log. Besides that I see that it suddenly stops logging behavior from the AzureDiscoveryGuardian and an hour after that the row entries are pruned. See this screenshot.

Saterday night around 20.00 the deployment started and logging started to appear but without error it stops suddenly.

image

image

Sunday around 9.30 I see that one node got restarted by the cluster and the same behavior started. Logging appeared and suddenly it stops trace logging and one hour later the rows are pruned (again without exceptions or errors in the logs).

I'll try to now first cleanup the cluster (as the journal/snapshot are corrupted now because of multiple clusters writing to the same storage) and try rerun with higher timeouts. I doubt if that will help as the 3 records are pruned a little more then one hour after the deployment, we see exact the same behavior on our dev cluster happening (the logs before are from our integration cluster which has real devices connected and thus a bit more traffic then dev).

@wesselkranenborg
Copy link

wesselkranenborg commented Feb 13, 2023

I doubt if the lastupdate column is properly updated. At what interval does that get an update? I'll post some data about that here (my cluster is manually recovered again and now works perfectly after pruning the journal/snapshots of the sharding and a restart of the nodes).

This is my current discovery table. I'll post another one here in 15 minutes:
image

@wesselkranenborg
Copy link

The lastupdate column is after almost 15 minutes still not updated (this is the column where according to the code the pruning is based on):

image

@wesselkranenborg
Copy link

In the logging I also only see these logmessages (which correspond to the timestamp in above tables).

image

Same with these logging messages
image

@wesselkranenborg
Copy link

@Arkatufus Maybe it's good to also add that during this timeframe I got these messages in the logs (tells something about actors being created succesfully):
image

And I didn't got any 'Failed to' messages in the logs (error logging from HeartbeatActor):
image

And exactly one hour after the records are added to the table they got pruned indeed:

[23/02/13-06:59:50.5657][AzureDiscoveryGuardian (akka://system-state-service)][akka://system-state-service/deadLetters][0031]: [system-state-service] 3 row entries pruned:
[ClusterMember] ServiceName: system-state-service, Address: 10.0.8.24, Port: 18558] Host: null, Created: 02/13/2023 05:59:44, Last update: 02/13/2023 06:00:44
[ClusterMember] ServiceName: system-state-service, Address: 10.0.8.43, Port: 18558] Host: null, Created: 02/13/2023 06:01:00, Last update: 02/13/2023 06:02:01
[ClusterMember] ServiceName: system-state-service, Address: 10.0.8.45, Port: 18558] Host: null, Created: 02/13/2023 06:00:23, Last update: 02/13/2023 06:01:23

@Arkatufus
Copy link
Contributor Author

Thank you for a very detailed report, I think I can narrow down the bug now.

@wesselkranenborg
Copy link

I've a gut feeling that the boolean _updating is not set to false after the first execution in the HeartbeatActor but I cannot pinpoint why.

@Arkatufus
Copy link
Contributor Author

Updates/heartbeats are supposed to happen at 1 minute intervals, while node pruning checks are done every 1 hours. Nodes are pruned if they have not been updated (no heartbeat detected) over the past 5 minutes.

@Arkatufus
Copy link
Contributor Author

So this is definetly a heartbeat problem, the heartbeat actor failed to do its job properly or the guardian actor failing to restart the heartbeat actor when it failed

@Arkatufus
Copy link
Contributor Author

@wesselkranenborg I think I pinned down the bug, it was caused by assuming that PipeTo always returns a value, some of the success calls were not processed correctly, we'll do a patch release to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants