CPU randomly spiking at 100% CPU #13312

minimind · 2020-12-30T14:35:09Z

My Environment

ArangoDB Version: 3.6.3
Storage Engine: RocksDB
Deployment Mode: Single Server
Deployment Strategy: Manual Start in Docker
Configuration: 4 vcpus 32 GiB memory
Infrastructure: Azure Standard E4s v3
Operating System: Ubuntu 16.04
Total RAM in your machine: 32Gb
Disks in use: SSD
Used Package: Docker - official Docker library

Component, Query & Data

Affected feature: Server

Problem:

We are encountering 100% CPU spikes on our ArangoDB database every week or so at random times. The CPU gradually increases linearly for about 30mins until it reaches 100%. Rebooting Arango sets the CPU to normal. There are no obvious causes. There are in the order of 100m documents in our main collection. There are a few slow queries but nothing out of the ordinary. Load on the DB is low and not out of the ordinary - it usually hovers around 5% or so. There are no errors in the logs. Reading advice in previous issues, we ran the 'top -H' and could see the CPU allocation is divided equally between the various SchedWorker processes. There are no errors in the logs, or any out-of-the-ordinary messages at all. There doesn't appear to be a resource limitation for RAM.

We'd like to know if this is expected or we should be concerned? Is this a known issue? What other information could we provide?

mpoeter · 2020-12-31T18:41:58Z

Can you take a coredump the next time this happens and share it? That way we can investigate what's going on.

woodytec · 2021-01-04T10:03:46Z

We are also encountering this kind of issue. The below dashboard shows spikes over 100% CPU for sometimes tens of minutes.

We are using ArangoDB 3.6.2, with RocksDB as a storage engine and in single server mode. Operating system is Windows 10.

We've collected metrics with Prometheus, on both ArangoDB and Windows. We can make them available if needed.

mpoeter · 2021-01-04T10:16:18Z

@woodytec can you create a minidump the next time this happens? On windows the easiest way is via the task manager - on the "details" page right click on the arangod.exe process and select "Create dump file".

minimind · 2021-01-04T11:18:07Z

@mpoeter Sharing a core dump might be problematic due to PII issues but I will find out. We are using a docker container with tag 3.6.3. What's the best way of of installing the debugging symbols for this build into the container so we can look at the coredump stacktraces and share that? Installing the debian package for the symbols into the docker container doesn't work off the bat.

Some more information: after about 30 mins of 100% CPU, the DB gets unresponsive and we get errors saying the scheduler queue is full.

minimind · 2021-01-04T16:36:12Z

When using strace on arangodb process, this is being reported endlessly:

futex({hex number}, FUTEX_WAIT_PRIVATE, 2, {0, {number}}) = -1 ETIMEDOUT (Connection timed out)

Does this provide any clues? Have you seen this before?

dothebart · 2021-01-11T09:30:27Z

Hi,
Please note that the latest bugfix-release is 3.6.10 ; please upgrade and check whether the issue persists.
Futexes are locks which are used to control access to resources that mustn't be used twice at the same time.

minimind · 2021-01-11T11:43:54Z

OK thanks. We are currently planning an upgrade to 3.7.6 next week. We'll see if the issue persists after, as you suggest.

Simran-B · 2021-01-27T15:49:12Z

@minimind Did you upgrade successfully? Does the problem persist?

Simran-B · 2021-03-17T10:38:20Z

@minimind Any update from your side?

minimind · 2021-03-17T10:59:59Z

Yes. We upgraded to 3.7.6 and this completely fixed the problem. We haven't had a single occurrence.

Simran-B · 2021-03-17T11:45:37Z

Great, thanks for the feedback! I'm going to close the issue, but please re-open should the problem occur again.

dothebart added 4 Linux 4 Windows and removed 4 Linux labels Jan 11, 2021

Simran-B added 1 Analyzing 4 Linux Waiting User Reply 4 Windows and removed 4 Windows labels Jan 13, 2021

Simran-B closed this as completed Mar 17, 2021

Simran-B added 2 Fixed Resolution and removed Waiting User Reply labels Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU randomly spiking at 100% CPU #13312

CPU randomly spiking at 100% CPU #13312

minimind commented Dec 30, 2020 •

edited

mpoeter commented Dec 31, 2020

woodytec commented Jan 4, 2021

mpoeter commented Jan 4, 2021

minimind commented Jan 4, 2021 •

edited

minimind commented Jan 4, 2021

dothebart commented Jan 11, 2021

minimind commented Jan 11, 2021

Simran-B commented Jan 27, 2021

Simran-B commented Mar 17, 2021

minimind commented Mar 17, 2021 •

edited

Simran-B commented Mar 17, 2021

CPU randomly spiking at 100% CPU #13312

CPU randomly spiking at 100% CPU #13312

Comments

minimind commented Dec 30, 2020 • edited

My Environment

Component, Query & Data

mpoeter commented Dec 31, 2020

woodytec commented Jan 4, 2021

mpoeter commented Jan 4, 2021

minimind commented Jan 4, 2021 • edited

minimind commented Jan 4, 2021

dothebart commented Jan 11, 2021

minimind commented Jan 11, 2021

Simran-B commented Jan 27, 2021

Simran-B commented Mar 17, 2021

minimind commented Mar 17, 2021 • edited

Simran-B commented Mar 17, 2021

minimind commented Dec 30, 2020 •

edited

minimind commented Jan 4, 2021 •

edited

minimind commented Mar 17, 2021 •

edited