Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return real pend value in erlang:dist_get_stat/1 #2270

Merged
merged 2 commits into from Jun 18, 2019

Conversation

@essen
Copy link
Contributor

@essen essen commented Jun 3, 2019

Only the dist_util code is using this function and it already
is compatible with a non-boolean value.

We are interested in using this value as a metric to know
how large the distribution output queue is when we encounter
distribution-related issues.

/cc @gerhard

Only the dist_util code is using this function and it already
is compatible with a non-boolean value.
@essen essen changed the base branch from master to maint Jun 3, 2019
@gerhard
Copy link

@gerhard gerhard commented Jun 3, 2019

This would be a great addition. We have already found the perfect space for this metric, right next to State of distribution links:

image

@AndrewDryga
Copy link
Contributor

@AndrewDryga AndrewDryga commented Jun 4, 2019

@gerhard is this is your tailor-made or some shared dashboard?

@gerhard
Copy link

@gerhard gerhard commented Jun 4, 2019

It's a dashboard that we - the RabbitMQ team - plan on sharing, most likely via grafana.com. It is not RabbitMQ specific, it will work with any Erlang cluster that runs prometheus.erl. The Erlang part was done in deadtrickster/prometheus.erl#92

This dashboard fits under a wider Observability initiative within RabbitMQ, all code currently lives under https://github.com/rabbitmq/rabbitmq-prometheus/tree/master/docker. To get it all up and running, check the Makefile in the parent dir, the up target is what spins everything up locally.

cc @michaelklishin

@essen
Copy link
Contributor Author

@essen essen commented Jun 11, 2019

It appears that the value is not the number of messages as previously thought but the number of bytes to be sent. Indeed it corresponds to the +zdbbl limit which is in (kilo)bytes.

@gerhard
Copy link

@gerhard gerhard commented Jun 11, 2019

We have deployed a 3-node Erlang cluster with this patch applied and wired everything together, this is what the end-result looks like (notice the Data buffered int he distribution links queue panel):

image

These are all the relevant beam.smp flags that we are using:

/erlang-22.0.2.2207/lib/erlang/erts-10.4.1/bin/beam.smp
  -W w
  -A 256 
  -MBas ageffcbf
  -MHas ageffcbf
  -MBlmbcs 512
  -MHlmbcs 512
  -MMmcs 30
  -P 1048576
  -t 5000000
  -stbt db
  -zdbbl 128000
  -K true

We are running on Linux 4.15.0-50-generic 16.04.1-Ubuntu SMP x86_64

Is there anything else that you need from us before merging & cutting a new OTP release with this in?

gerhard added a commit to rabbitmq/rabbitmq-prometheus that referenced this issue Jun 11, 2019
gerhard added a commit to rabbitmq/rabbitmq-server-boshrelease that referenced this issue Jun 17, 2019
This was done so that we can validate erlang/otp#2270
@garazdawi garazdawi merged commit a9373f7 into erlang:maint Jun 18, 2019
2 checks passed
@gerhard
Copy link

@gerhard gerhard commented Jun 18, 2019

:shipit:

@garazdawi
Copy link
Contributor

@garazdawi garazdawi commented Jun 18, 2019

We will leave this function undocumented (and thus subject to change without prior notice). This because changes that we may do in the future can make it impossible (or very expensive) to answer the question of how much data is in the queue.

@michaelklishin
Copy link
Contributor

@michaelklishin michaelklishin commented Jun 18, 2019

@garazdawi I think that observability should be an important design aspect. It would be very useful to have this metric or something that can be a reasonably close substitute.

@gerhard
Copy link

@gerhard gerhard commented Jun 18, 2019

As long as we can capture the busy dist limit buffer - how much of zdbbl is being used - we honestly don't care how it's done under the hood.

The end-goal is to quantify how busy a particular distribution link is, and to know when this is becoming a bottleneck in an Erlang cluster.

Can you suggest a better/different way of going about it?

This is what this patch gives us, which is pretty good:

image

gerhard pushed a commit to deadtrickster/prometheus.erl that referenced this issue Apr 24, 2020
Add node_queue_size_bytes metric to dist collector.

erlang/otp#2270 is available since Erlang/OTP 22.1, released 17th of September 2019, time to ship this feature 🚢

Thanks @essen!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants