You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Total RAM in your machine: 500 Mi per pod (test stand w/o any incoming connections)
Steps to reproduce
We tried to work around case with data corruption on leader instance.
Deploy Active Failover to Kubernetes, create some databases and collections.
We just delete all data from leader's PV.
Problem:
error while running applier thread for global database: got invalid response from leader at tcp://main-arangodb-1.egkuarango.svc:8529 for URL /_api/wal/open-transactions?global=true&serverId=117008723892253&from=0&to=1205: HTTP 404: Not Found - {"code":404,"error":true,"errorMessage":"NotFound: ","errorNum":1202}
Logs goes forever, until you restart leader pod by hand.
Expected result:
We were expected, that agency make a failover, and automatically resync data to old master, but in fact, in follower logs we got only errors about WALs absent.
Extra information:
We tried to automate this, and added livenessProbe, that checks VERSION file existence. But result is awful, if --agency.supervision-grace-period too big (60 sec in our case), and pod manages to get up before grace-period ends. Follower initiate full resync and flush all the data, to sync current leader.
2022-07-05T20:01:38Z [1] INFO [04e4e] {heartbeat} Starting replication from tcp://main-arangodb-1.egkuarango.svc:8529
2022-07-05T20:01:38Z [1] INFO [ab4a2] {heartbeat} start initial sync from leader
2022-07-05T20:01:39Z [1] WARNING [2b48f] {httpclient} retrying failed HTTP request for endpoint 'tcp://main-arangodb-1.egkuarango.svc:8529' for replication applier - retries left: 0 - Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] WARNING [de0be] {httpclient} retrying failed HTTP request for endpoint 'tcp://main-arangodb-1.egkuarango.svc:8529' for replication applier - no retries left - Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] ERROR [6fe50] {replication} error while running applier thread for global database: could not connect to leader at tcp://main-arangodb-1.egkuarango.svc:8529 for URL /_api/replication/logger-state?serverId=117008723892253: Could not connect to 'http+tcp://main-arangodb-1.egkuarango.svc:8529' 'connect() failed with #111 - Connection refused'
2022-07-05T20:01:40Z [1] INFO [21c52] {replication} stopped replication applier for global database
2022-07-05T20:01:41Z [1] WARNING [66d82] {heartbeat} forgetting previous applier state. Will trigger a full resync now
P.S.
I tired to tune some settings, --agency.supervision-grace-period , initialDelaySeconds for liveness and readiness probe. In some setups we got normal failover, but only in case, when --agency.supervision-grace-period was below 20 sec.
We can't hope that 20 second is enough. In some cases, to prevent false switch overs due flaky connection, or kubernetes lags, we used 60 seconds.
The text was updated successfully, but these errors were encountered:
My Environment
Steps to reproduce
We tried to work around case with data corruption on leader instance.
Problem:
Logs goes forever, until you restart leader pod by hand.
Expected result:
We were expected, that agency make a failover, and automatically resync data to old master, but in fact, in follower logs we got only errors about WALs absent.
Extra information:
We tried to automate this, and added livenessProbe, that checks VERSION file existence. But result is awful, if --agency.supervision-grace-period too big (60 sec in our case), and pod manages to get up before grace-period ends. Follower initiate full resync and flush all the data, to sync current leader.
P.S.
I tired to tune some settings, --agency.supervision-grace-period , initialDelaySeconds for liveness and readiness probe. In some setups we got normal failover, but only in case, when --agency.supervision-grace-period was below 20 sec.
We can't hope that 20 second is enough. In some cases, to prevent false switch overs due flaky connection, or kubernetes lags, we used 60 seconds.
The text was updated successfully, but these errors were encountered: