Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return real pend value in erlang:dist_get_stat/1 #2270

Merged
merged 2 commits into from Jun 18, 2019

Conversation

@essen
Copy link
Contributor

essen commented Jun 3, 2019

Only the dist_util code is using this function and it already
is compatible with a non-boolean value.

We are interested in using this value as a metric to know
how large the distribution output queue is when we encounter
distribution-related issues.

/cc @gerhard

Only the dist_util code is using this function and it already
is compatible with a non-boolean value.
@essen essen changed the base branch from master to maint Jun 3, 2019
@gerhard

This comment has been minimized.

Copy link

gerhard commented Jun 3, 2019

This would be a great addition. We have already found the perfect space for this metric, right next to State of distribution links:

image

@AndrewDryga

This comment has been minimized.

Copy link
Contributor

AndrewDryga commented Jun 4, 2019

@gerhard is this is your tailor-made or some shared dashboard?

@gerhard

This comment has been minimized.

Copy link

gerhard commented Jun 4, 2019

It's a dashboard that we - the RabbitMQ team - plan on sharing, most likely via grafana.com. It is not RabbitMQ specific, it will work with any Erlang cluster that runs prometheus.erl. The Erlang part was done in deadtrickster/prometheus.erl#92

This dashboard fits under a wider Observability initiative within RabbitMQ, all code currently lives under https://github.com/rabbitmq/rabbitmq-prometheus/tree/master/docker. To get it all up and running, check the Makefile in the parent dir, the up target is what spins everything up locally.

cc @michaelklishin

@essen

This comment has been minimized.

Copy link
Contributor Author

essen commented Jun 11, 2019

It appears that the value is not the number of messages as previously thought but the number of bytes to be sent. Indeed it corresponds to the +zdbbl limit which is in (kilo)bytes.

@gerhard

This comment has been minimized.

Copy link

gerhard commented Jun 11, 2019

We have deployed a 3-node Erlang cluster with this patch applied and wired everything together, this is what the end-result looks like (notice the Data buffered int he distribution links queue panel):

image

These are all the relevant beam.smp flags that we are using:

/erlang-22.0.2.2207/lib/erlang/erts-10.4.1/bin/beam.smp
  -W w
  -A 256 
  -MBas ageffcbf
  -MHas ageffcbf
  -MBlmbcs 512
  -MHlmbcs 512
  -MMmcs 30
  -P 1048576
  -t 5000000
  -stbt db
  -zdbbl 128000
  -K true

We are running on Linux 4.15.0-50-generic 16.04.1-Ubuntu SMP x86_64

Is there anything else that you need from us before merging & cutting a new OTP release with this in?

gerhard added a commit to rabbitmq/rabbitmq-prometheus that referenced this pull request Jun 11, 2019
gerhard added a commit to rabbitmq/rabbitmq-server-boshrelease that referenced this pull request Jun 17, 2019
This was done so that we can validate erlang/otp#2270
@garazdawi garazdawi merged commit a9373f7 into erlang:maint Jun 18, 2019
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details
@gerhard

This comment has been minimized.

Copy link

gerhard commented Jun 18, 2019

:shipit:

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented Jun 18, 2019

We will leave this function undocumented (and thus subject to change without prior notice). This because changes that we may do in the future can make it impossible (or very expensive) to answer the question of how much data is in the queue.

@michaelklishin

This comment has been minimized.

Copy link
Contributor

michaelklishin commented Jun 18, 2019

@garazdawi I think that observability should be an important design aspect. It would be very useful to have this metric or something that can be a reasonably close substitute.

@gerhard

This comment has been minimized.

Copy link

gerhard commented Jun 18, 2019

As long as we can capture the busy dist limit buffer - how much of zdbbl is being used - we honestly don't care how it's done under the hood.

The end-goal is to quantify how busy a particular distribution link is, and to know when this is becoming a bottleneck in an Erlang cluster.

Can you suggest a better/different way of going about it?

This is what this patch gives us, which is pretty good:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.