Reserve CPU for system operations #14068

SStorm · 2023-05-05T12:57:27Z

Problem Statement

With the right set of heavy queries, it is possible to exhaust the entire available CPUs on a CrateDB Cloud cluster. The cluster at that point becomes unresponsive and difficult to debug (i.e. not possible to query the sys.jobs and sys.jobs_log and other system tables).

It would be super useful if CrateDB had some form of QoS for threadpools, and always reserved a fraction of a CPU for system management operations.

An inspiration for this is the disk high watermark on Linux, where the last 5% (configurable) is reserved for root.

Possible Solutions

If QoS is not possible, can we somehow tweak the thread pool sizes when starting a cluster? This is if there is a separate thread pool for management/system operations.

Considered Alternatives

No response

The text was updated successfully, but these errors were encountered:

mfussenegger · 2023-05-08T07:55:57Z

I suspect the problem here isn't that threads aren't given enough CPU time, but rather that there is too much GC load. You should be seeing GC warnings in the logs.

The kernel scheduler should already ensure that each thread receives it's share of CPU time. The thread pools add some kind of QoS as you describe it. We use them to deal with blocking IO and run some kind of queries on different thread pools. E.g. system queries that don't hit IO run directly in the netty thread pool. Regular SELECTs on user tables run in the search thread pool and sys.shards queries use the get thread pool.

We already track various other improvements that would reduce GC load, or would help prevent GC load escalating, so I'm closing this.

Some of the more general ones:

mfussenegger · 2023-06-27T12:46:57Z

Re-opening this. Maybe we've cases where we overload the netty workers - which could lead to no longer processing new requests (e.g. follower checks & pings)

But needs some more investigation

It could also be options to tweak query scheduling. (E.g. some cooperative approach where RowConsumers yield after they ran for some time, or some other work stealing approach)

matriv added feature complexity: no estimate labels May 5, 2023

mfussenegger closed this as completed May 8, 2023

mfussenegger reopened this Jun 27, 2023

hlcianfagna added the feature: administration label Aug 3, 2023

henrikingo added this to CrateDB 5.10 Sep 20, 2024

henrikingo moved this to Candidates in CrateDB 5.10 Sep 20, 2024

mfussenegger added the need refined description A maintainer should refine the description and clarify the scope label Sep 24, 2024

mfussenegger mentioned this issue Oct 28, 2024

Log long running queries before they complete #16826

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reserve CPU for system operations #14068

Reserve CPU for system operations #14068

SStorm commented May 5, 2023

mfussenegger commented May 8, 2023

mfussenegger commented Jun 27, 2023 •

edited

Loading

Reserve CPU for system operations #14068

Reserve CPU for system operations #14068

Comments

SStorm commented May 5, 2023

Problem Statement

Possible Solutions

Considered Alternatives

mfussenegger commented May 8, 2023

mfussenegger commented Jun 27, 2023 • edited Loading

mfussenegger commented Jun 27, 2023 •

edited

Loading