Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArangoDB in Active Failover scheme won't make failover after mistakenly data loss on leader #16532

Open
GarretTheShadow opened this issue Jul 12, 2022 · 1 comment

Comments

@GarretTheShadow
Copy link

My Environment

  • ArangoDB Version: 3.8.4, 3.9.1
  • Deployment Mode: Active Failover
  • Deployment Strategy: Kubernetes
  • Total RAM in your machine: 500 Mi per pod (test stand w/o any incoming connections)

Steps to reproduce

We tried to work around case with data corruption on leader instance.

  1. Deploy Active Failover to Kubernetes, create some databases and collections.
  2. We just delete all data from leader's PV.

Problem:

error while running applier thread for global database: got invalid response from leader at tcp://main-arangodb-1.egkuarango.svc:8529 for URL /_api/wal/open-transactions?global=true&serverId=117008723892253&from=0&to=1205: HTTP 404: Not Found - {"code":404,"error":true,"errorMessage":"NotFound: ","errorNum":1202}

Logs goes forever, until you restart leader pod by hand.

Expected result:
We were expected, that agency make a failover, and automatically resync data to old master, but in fact, in follower logs we got only errors about WALs absent.

Extra information:
We tried to automate this, and added livenessProbe, that checks VERSION file existence. But result is awful, if --agency.supervision-grace-period too big (60 sec in our case), and pod manages to get up before grace-period ends. Follower initiate full resync and flush all the data, to sync current leader.

2022-07-05T20:01:38Z [1] INFO [04e4e] {heartbeat} Starting replication from tcp://main-arangodb-1.egkuarango.svc:8529
2022-07-05T20:01:38Z [1] INFO [ab4a2] {heartbeat} start initial sync from leader
2022-07-05T20:01:39Z [1] WARNING [2b48f] {httpclient} retrying failed HTTP request for endpoint 'tcp://main-arangodb-1.egkuarango.svc:8529' for replication applier - retries left: 0 - Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] WARNING [de0be] {httpclient} retrying failed HTTP request for endpoint 'tcp://main-arangodb-1.egkuarango.svc:8529' for replication applier - no retries left - Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] ERROR [6fe50] {replication} error while running applier thread for global database: could not connect to leader at tcp://main-arangodb-1.egkuarango.svc:8529 for URL /_api/replication/logger-state?serverId=117008723892253: Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] INFO [21c52] {replication} stopped replication applier for global database
2022-07-05T20:01:41Z [1] WARNING [66d82] {heartbeat} forgetting previous applier state. Will trigger a full resync now

P.S.
I tired to tune some settings, --agency.supervision-grace-period , initialDelaySeconds for liveness and readiness probe. In some setups we got normal failover, but only in case, when --agency.supervision-grace-period was below 20 sec.
We can't hope that 20 second is enough. In some cases, to prevent false switch overs due flaky connection, or kubernetes lags, we used 60 seconds.

@GarretTheShadow
Copy link
Author

Hello, any help here?
Did someone try to test this scenario?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant