Return real pend value in erlang:dist_get_stat/1 #2270

essen · 2019-06-03T15:04:26Z

Only the dist_util code is using this function and it already
is compatible with a non-boolean value.

We are interested in using this value as a metric to know
how large the distribution output queue is when we encounter
distribution-related issues.

/cc @gerhard

Only the dist_util code is using this function and it already is compatible with a non-boolean value.

gerhard · 2019-06-03T16:01:49Z

This would be a great addition. We have already found the perfect space for this metric, right next to State of distribution links:

AndrewDryga · 2019-06-04T11:57:14Z

@gerhard is this is your tailor-made or some shared dashboard?

gerhard · 2019-06-04T13:37:08Z

It's a dashboard that we - the RabbitMQ team - plan on sharing, most likely via grafana.com. It is not RabbitMQ specific, it will work with any Erlang cluster that runs prometheus.erl. The Erlang part was done in deadtrickster/prometheus.erl#92

This dashboard fits under a wider Observability initiative within RabbitMQ, all code currently lives under https://github.com/rabbitmq/rabbitmq-prometheus/tree/master/docker. To get it all up and running, check the Makefile in the parent dir, the up target is what spins everything up locally.

cc @michaelklishin

essen · 2019-06-11T15:17:44Z

It appears that the value is not the number of messages as previously thought but the number of bytes to be sent. Indeed it corresponds to the +zdbbl limit which is in (kilo)bytes.

gerhard · 2019-06-11T17:52:39Z

We have deployed a 3-node Erlang cluster with this patch applied and wired everything together, this is what the end-result looks like (notice the Data buffered int he distribution links queue panel):

These are all the relevant beam.smp flags that we are using:

/erlang-22.0.2.2207/lib/erlang/erts-10.4.1/bin/beam.smp
  -W w
  -A 256 
  -MBas ageffcbf
  -MHas ageffcbf
  -MBlmbcs 512
  -MHlmbcs 512
  -MMmcs 30
  -P 1048576
  -t 5000000
  -stbt db
  -zdbbl 128000
  -K true

We are running on Linux 4.15.0-50-generic 16.04.1-Ubuntu SMP x86_64

Is there anything else that you need from us before merging & cutting a new OTP release with this in?

re deadtrickster/prometheus.erl#94 re erlang/otp#2270 [#166574772]

This was done so that we can validate erlang/otp#2270

gerhard · 2019-06-18T13:25:47Z

garazdawi · 2019-06-18T13:30:58Z

We will leave this function undocumented (and thus subject to change without prior notice). This because changes that we may do in the future can make it impossible (or very expensive) to answer the question of how much data is in the queue.

michaelklishin · 2019-06-18T13:38:12Z

@garazdawi I think that observability should be an important design aspect. It would be very useful to have this metric or something that can be a reasonably close substitute.

gerhard · 2019-06-18T13:54:30Z

As long as we can capture the busy dist limit buffer - how much of zdbbl is being used - we honestly don't care how it's done under the hood.

The end-goal is to quantify how busy a particular distribution link is, and to know when this is becoming a bottleneck in an Erlang cluster.

Can you suggest a better/different way of going about it?

This is what this patch gives us, which is pretty good:

@essen

Add node_queue_size_bytes metric to dist collector. erlang/otp#2270 is available since Erlang/OTP 22.1, released 17th of September 2019, time to ship this feature 🚢 Thanks @essen!

Return real pend value in erlang:dist_get_stat/1

3028401

Only the dist_util code is using this function and it already is compatible with a non-boolean value.

essen changed the base branch from master to maint June 3, 2019 15:04

essen mentioned this pull request Jun 3, 2019

alt_dist docs: Correct the mf_getstat description #2271

Merged

rickard-green added the team:VM Assigned to OTP team VM label Jun 10, 2019

essen mentioned this pull request Jun 11, 2019

Add node_queue_size metric to dist collector deadtrickster/prometheus.erl#94

Merged

fixup! Return real pend value in erlang:dist_get_stat/1

6d337a2

gerhard added a commit to rabbitmq/rabbitmq-prometheus that referenced this pull request Jun 11, 2019

Respond to learnings from a LRE PromStack & Erlang Distribution metrics

3598810

re deadtrickster/prometheus.erl#94 re erlang/otp#2270 [#166574772]

gerhard added a commit to rabbitmq/rabbitmq-server-boshrelease that referenced this pull request Jun 17, 2019

Add a patched version of OTP 22.0.2

0b0f0fd

This was done so that we can validate erlang/otp#2270

garazdawi merged commit a9373f7 into erlang:maint Jun 18, 2019

gerhard mentioned this pull request Apr 27, 2020

Metrics missing for distribution buffer busy limit - zdbbl rabbitmq/rabbitmq-prometheus#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return real pend value in erlang:dist_get_stat/1 #2270

Return real pend value in erlang:dist_get_stat/1 #2270

essen commented Jun 3, 2019

gerhard commented Jun 3, 2019 •

edited

Loading

AndrewDryga commented Jun 4, 2019

gerhard commented Jun 4, 2019 •

edited

Loading

essen commented Jun 11, 2019 •

edited

Loading

gerhard commented Jun 11, 2019

gerhard commented Jun 18, 2019

garazdawi commented Jun 18, 2019

michaelklishin commented Jun 18, 2019

gerhard commented Jun 18, 2019 •

edited

Loading

Return real pend value in erlang:dist_get_stat/1 #2270

Return real pend value in erlang:dist_get_stat/1 #2270

Conversation

essen commented Jun 3, 2019

gerhard commented Jun 3, 2019 • edited Loading

AndrewDryga commented Jun 4, 2019

gerhard commented Jun 4, 2019 • edited Loading

essen commented Jun 11, 2019 • edited Loading

gerhard commented Jun 11, 2019

gerhard commented Jun 18, 2019

garazdawi commented Jun 18, 2019

michaelklishin commented Jun 18, 2019

gerhard commented Jun 18, 2019 • edited Loading

gerhard commented Jun 3, 2019 •

edited

Loading

gerhard commented Jun 4, 2019 •

edited

Loading

essen commented Jun 11, 2019 •

edited

Loading

gerhard commented Jun 18, 2019 •

edited

Loading