Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserve CPU for system operations #14068

Open
SStorm opened this issue May 5, 2023 · 2 comments
Open

Reserve CPU for system operations #14068

SStorm opened this issue May 5, 2023 · 2 comments
Labels
complexity: no estimate feature: administration feature need refined description A maintainer should refine the description and clarify the scope

Comments

@SStorm
Copy link

SStorm commented May 5, 2023

Problem Statement

With the right set of heavy queries, it is possible to exhaust the entire available CPUs on a CrateDB Cloud cluster. The cluster at that point becomes unresponsive and difficult to debug (i.e. not possible to query the sys.jobs and sys.jobs_log and other system tables).

It would be super useful if CrateDB had some form of QoS for threadpools, and always reserved a fraction of a CPU for system management operations.

An inspiration for this is the disk high watermark on Linux, where the last 5% (configurable) is reserved for root.

Possible Solutions

If QoS is not possible, can we somehow tweak the thread pool sizes when starting a cluster? This is if there is a separate thread pool for management/system operations.

Considered Alternatives

No response

@mfussenegger
Copy link
Member

I suspect the problem here isn't that threads aren't given enough CPU time, but rather that there is too much GC load. You should be seeing GC warnings in the logs.

The kernel scheduler should already ensure that each thread receives it's share of CPU time. The thread pools add some kind of QoS as you describe it. We use them to deal with blocking IO and run some kind of queries on different thread pools. E.g. system queries that don't hit IO run directly in the netty thread pool. Regular SELECTs on user tables run in the search thread pool and sys.shards queries use the get thread pool.

We already track various other improvements that would reduce GC load, or would help prevent GC load escalating, so I'm closing this.

Some of the more general ones:

@mfussenegger
Copy link
Member

mfussenegger commented Jun 27, 2023

Re-opening this. Maybe we've cases where we overload the netty workers - which could lead to no longer processing new requests (e.g. follower checks & pings)

But needs some more investigation

It could also be options to tweak query scheduling. (E.g. some cooperative approach where RowConsumers yield after they ran for some time, or some other work stealing approach)

@mfussenegger mfussenegger reopened this Jun 27, 2023
@henrikingo henrikingo moved this to Candidates in CrateDB 5.10 Sep 20, 2024
@mfussenegger mfussenegger added the need refined description A maintainer should refine the description and clarify the scope label Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity: no estimate feature: administration feature need refined description A maintainer should refine the description and clarify the scope
Projects
Status: Candidates
Development

No branches or pull requests

4 participants