Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active failover fails to install foxx app #13915

Closed
rivet351 opened this issue Apr 6, 2021 · 10 comments
Closed

Active failover fails to install foxx app #13915

rivet351 opened this issue Apr 6, 2021 · 10 comments

Comments

@rivet351
Copy link

rivet351 commented Apr 6, 2021

My Environment

  • ArangoDB Version: 3.7.6
  • Storage Engine: RocksDB
  • Deployment Mode: Active Failover
  • Deployment Strategy: Kubernetes
  • Configuration: Azure Kubernetes with load balancer
  • Infrastructure: Azure
  • Operating System: Azure Kubernetes with docker
  • Total RAM in your machine: 32gb
  • Disks in use: SSD
  • Used Package: Docker - official Docker library

Component, Query & Data

Affected feature:
Installation of Foxx app following failover event

Replication Factor & Number of Shards (Cluster only):
Leader with 2 x Followers

Steps to reproduce

  1. Kill leader to trigger failover
  2. Inspect UI (Foxx app sometimes appears, sometimes does not and is intermittent on refresh)
  3. Error seen in UI and kubernetes logs:

2021-04-06T15:18:32Z [1] ERROR [24213] Failed to load Foxx service mounted at "/mailapi"
2021-04-06T15:18:32Z [1] ERROR [24213] via ArangoError: service files missing
2021-04-06T15:18:32Z [1] ERROR [24213] Mount: /mailapi
2021-04-06T15:18:32Z [1] ERROR [24213] at loadInstalledService (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/manager.js:616:13)
2021-04-06T15:18:32Z [1] ERROR [24213] at initLocalServiceMap (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/manager.js:519:23)
2021-04-06T15:18:32Z [1] ERROR [24213] at selfHeal (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/manager.js:245:5)
2021-04-06T15:18:32Z [1] ERROR [24213] at Object.selfHealAll [as healAll] (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/manager.js:196:20)
2021-04-06T15:18:32Z [1] ERROR [24213] at Object.exports.manage (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/queues/manager.js:234:19)
2021-04-06T15:18:32Z [1] ERROR [24213] at eval (eval at (unknown source), :2:50)
2021-04-06T15:18:32Z [1] ERROR [24213] at eval (eval at (unknown source), :3:9)
2021-04-06T15:18:32Z [1] ERROR [24213] at eval (eval at (unknown source), :3:21)

Problem:
The failover does look to take place (new leader is elected and the service attempts to install the foxx app previously working on the original leader. However we see error 503 - foxx app sometimes viewable, sometimes not - the foxx app then needs hard deleting before a new installation working installation can be done. Until this is done the foxx app remains in a partially broken state (sometimes performing the jobs as expected, otherwise returning 503 errors)

Expected result:
Foxx app reinstalls without errors on new leader following failover

@Simran-B
Copy link
Contributor

Hi @rivet351,
3.7.9 has a fix related to Foxx self-heal. It doesn't seem to be related to your issue, but could you try the latest version anyway?

Are you able to reproduce the problem in a local setup or only on Azure? I wonder if it's Azure specific or not.

Does the initial leader come back up as follower and remain follower?

Can you compare the contents of the _appbundles system collections and the Foxx service files in the file system of both leader and follower? I wonder if there's a discrepancy at some point. Also, what is the file system? Maybe some files are actually not available temporarily for some reason?

@rivet351
Copy link
Author

rivet351 commented Apr 29, 2021

Hi @Simran-B

We upgraded to 3.7.10 and have repeated multiple times with failover completing successfully so it seems the self-heal was the issue. Thanks for your help and I'm closing the ticket.

@rivet351
Copy link
Author

rivet351 commented May 10, 2021

Hi,

I've re-opened the ticket as we have now had this happen to our leader without a failover event (similar errors as seen in the stack trace above).

The only fix we have is to re-deploy the same foxx app build. This immediately fixes the issue and service resumes as normal.

The two differences in the stack trace this time:

`2021-05-10T08:44:38Z [1] ERROR [24213] via ArangoError: service files outdated
 
2021-05-10T08:44:41Z [1] ERROR [24213]     at loadInstalledService (/usr/share/arangodb3/js/server/modules/@arangodb/foxx/manager.js:626:13)`

@Simran-B Simran-B added 1 Analyzing and removed 2 Fixed Resolution labels May 10, 2021
@pluma
Copy link
Contributor

pluma commented May 12, 2021

@rivet351 Can you confirm that this happened after an upgrade/replace of the Foxx app on the leader which then/later produced this log output?

@rivet351
Copy link
Author

rivet351 commented May 12, 2021

Hi,

It looks to have happened randomly, several days after our last deployment. Timeline for this looks like (v.3.7.10):
2021-05-05 - deployed a new Foxx app version to the leader
2021-05-05 - 2021-05-09 - worked as normal
2021-05-10 - error above seen - no failover - required manually deleting and triggering a re-install to work again

@cubeover
Copy link

cubeover commented May 13, 2021

we too are seeing the issue on 3.7.6. Started after a failover.

@cubeover
Copy link

cubeover commented May 13, 2021

[24213] at eval (eval at (unknown source), :3:21)
[24213] at eval (eval at (unknown source), :3:9)
[24213] at eval (eval at (unknown source), :2:50)
[24213] at Object.exports.manage (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/queues/manager.js:222:19)
[24213] at Object.selfHealAll [as healAll] (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/manager.js:192:20)
[4213] at selfHeal (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/manager.js:231:5)
[24213] at initLocalServiceMap (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/manager.js:439:23)
[24213] at loadInstalledService (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/manager.js:519:13)
[24213] Mount: /foxx
[24213] via ArangoError: service files outdated
[24213] Failed to load Foxx service mounted at "/foxx"
[24213] at eval (eval at (unknown source), :3:21)
[24213] at eval (eval at (unknown source), :3:9)
[24213] at eval (eval at (unknown source), :2:50)
[24213] at Object.exports.manage (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/queues/manager.js:222:19)
[24213] at Object.selfHealAll [as healAll] (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/manager.js:192:20)
[24213] at selfHeal (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/manager.js:231:5)
[24213] at initLocalServiceMap (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/manager.js:439:23)
[24213] at loadInstalledService (/volumes/data/ARANGO/resilientsingle8529/data/js/server/modules/@arangodb/foxx/manager.js:519:13)
[24213] Mount: /foxx

@rivet351
Copy link
Author

rivet351 commented Jun 4, 2021

Hi,

Any update on this?

@Simran-B
Copy link
Contributor

Simran-B commented Jun 9, 2021

Unfortunately not. I created an internal ticket https://arangodb.atlassian.net/browse/BTS-484 for tracking the issue.

@dothebart
Copy link
Contributor

Hi,
This has been fixed by this PR: #14754 which is available for download with the latest ArangoDB 3.7 releases.
I'm sorry we didn't comment this here too since the missing self heal was discovered by internal tests independently as well and hence fixed without communicating here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants