Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU randomly spiking at 100% CPU #13312

Closed
minimind opened this issue Dec 30, 2020 · 11 comments
Closed

CPU randomly spiking at 100% CPU #13312

minimind opened this issue Dec 30, 2020 · 11 comments

Comments

@minimind
Copy link

minimind commented Dec 30, 2020

My Environment

  • ArangoDB Version: 3.6.3
  • Storage Engine: RocksDB
  • Deployment Mode: Single Server
  • Deployment Strategy: Manual Start in Docker
  • Configuration: 4 vcpus 32 GiB memory
  • Infrastructure: Azure Standard E4s v3
  • Operating System: Ubuntu 16.04
  • Total RAM in your machine: 32Gb
  • Disks in use: SSD
  • Used Package: Docker - official Docker library

Component, Query & Data

Affected feature: Server

Problem:

We are encountering 100% CPU spikes on our ArangoDB database every week or so at random times. The CPU gradually increases linearly for about 30mins until it reaches 100%. Rebooting Arango sets the CPU to normal. There are no obvious causes. There are in the order of 100m documents in our main collection. There are a few slow queries but nothing out of the ordinary. Load on the DB is low and not out of the ordinary - it usually hovers around 5% or so. There are no errors in the logs. Reading advice in previous issues, we ran the 'top -H' and could see the CPU allocation is divided equally between the various SchedWorker processes. There are no errors in the logs, or any out-of-the-ordinary messages at all. There doesn't appear to be a resource limitation for RAM.

We'd like to know if this is expected or we should be concerned? Is this a known issue? What other information could we provide?

@mpoeter
Copy link
Member

mpoeter commented Dec 31, 2020

Can you take a coredump the next time this happens and share it? That way we can investigate what's going on.

@woodytec
Copy link

woodytec commented Jan 4, 2021

We are also encountering this kind of issue. The below dashboard shows spikes over 100% CPU for sometimes tens of minutes.
CPU Arango

We are using ArangoDB 3.6.2, with RocksDB as a storage engine and in single server mode. Operating system is Windows 10.

We've collected metrics with Prometheus, on both ArangoDB and Windows. We can make them available if needed.

@mpoeter
Copy link
Member

mpoeter commented Jan 4, 2021

@woodytec can you create a minidump the next time this happens? On windows the easiest way is via the task manager - on the "details" page right click on the arangod.exe process and select "Create dump file".

@minimind
Copy link
Author

minimind commented Jan 4, 2021

@mpoeter Sharing a core dump might be problematic due to PII issues but I will find out. We are using a docker container with tag 3.6.3. What's the best way of of installing the debugging symbols for this build into the container so we can look at the coredump stacktraces and share that? Installing the debian package for the symbols into the docker container doesn't work off the bat.

Some more information: after about 30 mins of 100% CPU, the DB gets unresponsive and we get errors saying the scheduler queue is full.

@minimind
Copy link
Author

minimind commented Jan 4, 2021

When using strace on arangodb process, this is being reported endlessly:

futex({hex number}, FUTEX_WAIT_PRIVATE, 2, {0, {number}}) = -1 ETIMEDOUT (Connection timed out)

Does this provide any clues? Have you seen this before?

@dothebart
Copy link
Contributor

Hi,
Please note that the latest bugfix-release is 3.6.10 ; please upgrade and check whether the issue persists.
Futexes are locks which are used to control access to resources that mustn't be used twice at the same time.

@minimind
Copy link
Author

OK thanks. We are currently planning an upgrade to 3.7.6 next week. We'll see if the issue persists after, as you suggest.

@Simran-B
Copy link
Contributor

@minimind Did you upgrade successfully? Does the problem persist?

@Simran-B
Copy link
Contributor

@minimind Any update from your side?

@minimind
Copy link
Author

minimind commented Mar 17, 2021

Yes. We upgraded to 3.7.6 and this completely fixed the problem. We haven't had a single occurrence.

@Simran-B
Copy link
Contributor

Great, thanks for the feedback! I'm going to close the issue, but please re-open should the problem occur again.

@Simran-B Simran-B added 2 Fixed Resolution and removed Waiting User Reply labels Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants