Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around statistics(run_queue) returning incorrect data #3160

Closed
chewbranca opened this issue Sep 17, 2020 · 4 comments
Closed

Work around statistics(run_queue) returning incorrect data #3160

chewbranca opened this issue Sep 17, 2020 · 4 comments

Comments

@chewbranca
Copy link
Contributor

chewbranca commented Sep 17, 2020

Description

In the /_system endpoint, particularly in the chttpd_node:get_stats/0 function, one of the data points returned is an integer reflecting the total sum of the length of all run queues [1]. With the introduction of dirty schedulers in Erlang, the statistics(run_queue) statistic now includes information about the depth of the dirty CPU and IO queues, however, the documentation indicates it should not be including those values. At least that's how it appears to me, I've filed a bug on the matter [2] and I'll keep an eye on that.

In the meantime, we can isolate the scheduler run queues from the dirty scheduler run queues, and report those values separately which will ensure we have an accurate value for run_queue and also allow us to introduce a new value for the dirty CPU queue.

For more context, CouchDB uses several NIFs, but none of them are dirty NIFs. However, OTP's rsa_generate_key function is a dirty NIF [3], and will flow through the dirty NIF CPU queue. If you're using something like HAProxy for SSL termination, you won't normally encounter this, however, we can still end up making dirty NIF calls for a push replication to a remote instance over https. The end result is that the run_queue statistic in the /_system endpoint will reflect the amount of crypto work going through the dirty cpu queue and will give false positives on run_queue overload, as elevated run_queues are usually indicative of the Erlang VM having more work to handle that resources available.

Steps to Reproduce

See the bug report in [2] for more details.

Expected Behaviour

Your Environment

  • CouchDB version used:
  • Browser name and version:
  • Operating system and version:

Additional Context

[1] https://github.com/apache/couchdb/blob/master/src/chttpd/src/chttpd_node.erl#L216
[2] https://bugs.erlang.org/browse/ERL-1355
[3] https://github.com/erlang/otp/blob/db6059a9217767a6e42e93cec05089c0ec977d20/lib/crypto/c_src/crypto.c#L3016-L3018

@chewbranca
Copy link
Contributor Author

I've got a proof of concept implementation separating out the scheduler and dirty scheduler run_queues in [1].

That PR uses the relatively new statistic call statistics(run_queue_lengths) which was introduced in OTP-18.3. This statistic returns a list of length N + 1 where N is the number of regular schedulers, and +1 is the currently fixed value of the number of dirty CPU schedulers in Erlang. You can see in the PR [1] that I'm using that data to also construct a new stat entry for the dirty CPU queue. We could use the even newer statistics(run_queue_lengths_all) statistic which returns a list of N + 2 for the N schedulers and the 2 dirty scheduler queues, however, Erlang does not yet use dirty IO NIFs for file IO, so that value will always be zero, and that function was introduced in OTP-20, and we still kinda sorta maybe tolerate OTP-19. I'm not opposed to requiring OTP-20, but there's not a big rush on this front as we don't use dirty IO schedulers yet.

Btw, the PR is against 3.x, we'll want to have a separate PR for 4.x, which is why I'm adding the details here in the issue.

[1] #3161
[2] https://erlang.org/doc/man/erlang.html#statistics_run_queue_lengths
[3] https://erlang.org/doc/man/erlang.html#statistics_run_queue_lengths_all

@chewbranca
Copy link
Contributor Author

I got confirmation in the Erlang bug report [1] that there is in fact a discrepancy between the documentation of statistics(run_queue) and the output, however, Rickard Green is of the opinion that it all the CPU run_queues should be aggregated together, which certainly has merits. This morning he merged a change to the maintenance branch updating the documentation to indicate statistics(run_queue) does not exclude the dirty CPU queue.

That said, I don't think that's appropriate for our use case. I've replied to the ticket demonstrating a scenario where aggregating the normal and dirty CPU queues can obscure the meaning of statistics(run_queue), especially for CouchDB where the scenarios of elevated normal run_queues are fundamentally different than that of an elevated dirty CPU queue (assuming you use a proxy for SSL termination).

Therefore I remain of the opinion that it's important to separate the regular run_queue values from the dirty cpu run_queue. In the PR I linked above, I've already started going down that path using the statistics(run_queue_lengths) value, which is also much more efficient than what we've been doing, as statistics(run_queue) performs a lock on all run_queues to read from them, whereas the newer APIs do not lock. So we can switch to the new version that doesn't lock and isolate these two data points as separate metrics.

[1] https://bugs.erlang.org/browse/ERL-1355

@chewbranca
Copy link
Contributor Author

Alright, I've got two PRs out for the 3.x line and the main line:

main: #3168
3.x: #3161

@chewbranca
Copy link
Contributor Author

I've merged both PRs, and now the CouchDB run_queues statistic represents the same values as before, along with a new statistic for the dirty cpu queue. Closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant