New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around statistics(run_queue) returning incorrect data #3160
Comments
I've got a proof of concept implementation separating out the scheduler and dirty scheduler run_queues in [1]. That PR uses the relatively new statistic call Btw, the PR is against 3.x, we'll want to have a separate PR for 4.x, which is why I'm adding the details here in the issue. [1] #3161 |
I got confirmation in the Erlang bug report [1] that there is in fact a discrepancy between the documentation of That said, I don't think that's appropriate for our use case. I've replied to the ticket demonstrating a scenario where aggregating the normal and dirty CPU queues can obscure the meaning of Therefore I remain of the opinion that it's important to separate the regular run_queue values from the dirty cpu run_queue. In the PR I linked above, I've already started going down that path using the |
I've merged both PRs, and now the CouchDB run_queues statistic represents the same values as before, along with a new statistic for the dirty cpu queue. Closing this out. |
Description
In the
/_system
endpoint, particularly in thechttpd_node:get_stats/0
function, one of the data points returned is an integer reflecting the total sum of the length of all run queues [1]. With the introduction of dirty schedulers in Erlang, thestatistics(run_queue)
statistic now includes information about the depth of the dirty CPU and IO queues, however, the documentation indicates it should not be including those values. At least that's how it appears to me, I've filed a bug on the matter [2] and I'll keep an eye on that.In the meantime, we can isolate the scheduler run queues from the dirty scheduler run queues, and report those values separately which will ensure we have an accurate value for
run_queue
and also allow us to introduce a new value for the dirty CPU queue.For more context, CouchDB uses several NIFs, but none of them are dirty NIFs. However, OTP's
rsa_generate_key
function is a dirty NIF [3], and will flow through the dirty NIF CPU queue. If you're using something like HAProxy for SSL termination, you won't normally encounter this, however, we can still end up making dirty NIF calls for a push replication to a remote instance over https. The end result is that therun_queue
statistic in the/_system
endpoint will reflect the amount of crypto work going through the dirty cpu queue and will give false positives on run_queue overload, as elevated run_queues are usually indicative of the Erlang VM having more work to handle that resources available.Steps to Reproduce
See the bug report in [2] for more details.
Expected Behaviour
Your Environment
Additional Context
[1] https://github.com/apache/couchdb/blob/master/src/chttpd/src/chttpd_node.erl#L216
[2] https://bugs.erlang.org/browse/ERL-1355
[3] https://github.com/erlang/otp/blob/db6059a9217767a6e42e93cec05089c0ec977d20/lib/crypto/c_src/crypto.c#L3016-L3018
The text was updated successfully, but these errors were encountered: