ArangoDB in Active Failover scheme won't make failover after mistakenly data loss on leader #16532

GarretTheShadow · 2022-07-12T08:27:47Z

My Environment

ArangoDB Version: 3.8.4, 3.9.1
Deployment Mode: Active Failover
Deployment Strategy: Kubernetes
Total RAM in your machine: 500 Mi per pod (test stand w/o any incoming connections)

Steps to reproduce

We tried to work around case with data corruption on leader instance.

Deploy Active Failover to Kubernetes, create some databases and collections.
We just delete all data from leader's PV.

Problem:

error while running applier thread for global database: got invalid response from leader at tcp://main-arangodb-1.egkuarango.svc:8529 for URL /_api/wal/open-transactions?global=true&serverId=117008723892253&from=0&to=1205: HTTP 404: Not Found - {"code":404,"error":true,"errorMessage":"NotFound: ","errorNum":1202}

Logs goes forever, until you restart leader pod by hand.

Expected result:
We were expected, that agency make a failover, and automatically resync data to old master, but in fact, in follower logs we got only errors about WALs absent.

Extra information:
We tried to automate this, and added livenessProbe, that checks VERSION file existence. But result is awful, if --agency.supervision-grace-period too big (60 sec in our case), and pod manages to get up before grace-period ends. Follower initiate full resync and flush all the data, to sync current leader.

2022-07-05T20:01:38Z [1] INFO [04e4e] {heartbeat} Starting replication from tcp://main-arangodb-1.egkuarango.svc:8529
2022-07-05T20:01:38Z [1] INFO [ab4a2] {heartbeat} start initial sync from leader
2022-07-05T20:01:39Z [1] WARNING [2b48f] {httpclient} retrying failed HTTP request for endpoint 'tcp://main-arangodb-1.egkuarango.svc:8529' for replication applier - retries left: 0 - Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] WARNING [de0be] {httpclient} retrying failed HTTP request for endpoint 'tcp://main-arangodb-1.egkuarango.svc:8529' for replication applier - no retries left - Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] ERROR [6fe50] {replication} error while running applier thread for global database: could not connect to leader at tcp://main-arangodb-1.egkuarango.svc:8529 for URL /_api/replication/logger-state?serverId=117008723892253: Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] INFO [21c52] {replication} stopped replication applier for global database
2022-07-05T20:01:41Z [1] WARNING [66d82] {heartbeat} forgetting previous applier state. Will trigger a full resync now

P.S.
I tired to tune some settings, --agency.supervision-grace-period , initialDelaySeconds for liveness and readiness probe. In some setups we got normal failover, but only in case, when --agency.supervision-grace-period was below 20 sec.
We can't hope that 20 second is enough. In some cases, to prevent false switch overs due flaky connection, or kubernetes lags, we used 60 seconds.

The text was updated successfully, but these errors were encountered:

GarretTheShadow · 2022-07-19T14:29:23Z

Hello, any help here?
Did someone try to test this scenario?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArangoDB in Active Failover scheme won't make failover after mistakenly data loss on leader #16532

ArangoDB in Active Failover scheme won't make failover after mistakenly data loss on leader #16532

GarretTheShadow commented Jul 12, 2022

GarretTheShadow commented Jul 19, 2022

ArangoDB in Active Failover scheme won't make failover after mistakenly data loss on leader #16532

ArangoDB in Active Failover scheme won't make failover after mistakenly data loss on leader #16532

Comments

GarretTheShadow commented Jul 12, 2022

My Environment

Steps to reproduce

GarretTheShadow commented Jul 19, 2022