Prometheus integration #96

kevinbader · 2018-10-11T07:26:20Z

Should be fairly easy for Phoenix request metrics, for example see https://blog.smartlogic.io/monitoring-your-elixir-app-with-prometheus/

Ideas for custom metrics:

number of sessions
number of subscriptions (per connection)
...

EDIT: could be implemented using https://github.com/shinyscorpion/wObserver

jakobwgnr · 2018-11-25T16:24:47Z

Tried implementing using prometheus_ex + implementing the standard PhoenixInstrumenter.

Nevertheless I'm not 100% sure if the PhoenixInstrumenter makes sense as there are two separate webservers within rig and I assume that that this standard library is not able to fetch Information for both of them nor to differentiate between them.

considering #102 as the new application architecture, the Metrics endpoint is planned to be exposed via :rig_control

I'm not sure if to implement within that application in /lib/Metrics (at least they do so in https://github.com/oestrich/ex_venture/lib/Metrics)
Anyways, their Project Setup is a bit different & we'd need to monitor metrics from other applications. Hence a dependency to :rig_control would be required… Therefore maybe a seperate application makes sense.

Deciding between Prometheus_ex & wObserver, at least for Prometheus metrics I'd go for Prometheus_ex, as implementing custom metrics seems to look more straight forward & cleaner from a Code perspective

wObserver nevertheless provides this fancy frontend showing these standard observer Infos, which might be helpful…
Maybe implementing both (Prometheus_ex for custom metrics & wobserver for standard info's) is also an option.

@kevinbader: please provide feedback on my thoughts and how you'd like to have it implemented, I'd like to support on this :)

kevinbader · 2018-11-25T17:37:46Z

@vanillebear nice appreciate the support :)

I think you're right, there is a dependency issue like this. Let me summarize:

The /metrics endpoint is exposed at the endpoint running in :rig_control, using Prometheus.PlugExporter. The PlugExporter's setup/0 call is done here (which sets up the ETS tables).
The instrumeter modules (Prometheus.Metric) reside next to what they're measuring (i.e., in :rig_sessions, :rig_event_hub, :rig_proxy).
As soon as one of the instrumenters is invoked, RIG will crash unless :rig_control has been started first (due to missing ETS tables).

The only way to guarantee that :rig_control is started first is to make it a depedency of the instrumented apps, which causes a cyclic dependency (which is what you've meant, I assume). To prevent this, we need a dedicated Prometheus OTP app that is started first. The PlugExporter can still be used in :rig_control, but the setup call has to be moved there. I'll update the diagram in #102 accordingly...

jakobwgnr · 2018-12-04T15:24:44Z

Assuming that the current implementation will be refactored (e.g. rig_api -> rig_control) still I hope what I'm doing on the current Project structure, is still valid in a future layout :)

Currently working on the implementation - The initial Setup of the new OTP app is done (including updates in dockerfile, etc.) + Exporting the endpoint via rig_api.

Current app structure:
rig_metrics

lib
-- application.ex
-- metrics_exporter.ex
-- proxy_instrumenter.ex
-- eventhub_instrumenter.ex

I'd propose following metrics for the ProxyInstrumenter as @kevinbader suggested.
rig_sessions_total --> As Prometheus Counter
rig_sessions_current --> As Prometheus Gauge

rig_subscriptions_total --> As Prometheus Counter + Label ConnectionID?!
rig_subscriptions_current --> As Prometheus Gauge + Label ConnectionID?!
Is there something like the Connection ID we could use for the label?

rig_request_size_bytes -> As Prometheus Summary
rig_response_size_bytes -> As Prometheus Summary

Currently no idea for metrics on the Event hub, but in general we could use labels to identify the type (e.g. Kafka)

I'm not 100% sure on were to track the metrics as I'm not 100% aware of the RIG implementation in general. Could someone propose were to implement?

I didn't provide a PR yet as still work in progress. Current version can be found in https://github.com/vanillebear/reactive-interaction-gateway/tree/feature/96-promotheus-integration

kevinbader · 2018-12-04T20:24:49Z

Nice work! Some thoughts:

rig_sessions_total --> As Prometheus Counter
rig_subscriptions_total --> As Prometheus Counter + Label ConnectionID?!

I wonder if/how they could be useful..

rig_subscriptions_current --> As Prometheus Gauge + Label ConnectionID?!
Is there something like the Connection ID we could use for the label?
...
Currently no idea for metrics on the Event hub, but in general we could use labels to identify the type (e.g. Kafka)

For certain proxy features we build a "correlation ID" that is basically the encoded process ID; the same mechanism could be used to create a "connection ID" for a given websocket or SSE connection. What would you use the label for?

For each event type there is one subscription table, so having a Gauge per event type should be as easy as reporting the number of rows in the respective (ETS) table.

The number of subscriptions per connection can be calculated by grouping the rows by pid, but I guess this will get a bit slow with the number of subscriptions: as one process (pid = process id) may have multiple subscriptions in the same table, the pid column is not unique => no index, slow counting. And if we'd want to report the number of subscriptions regardless of event types, we'd even have to count this for each table (again, one table per event type). So perhaps we'll skip that metric.

Other ideas for metrics:

number of open (inbound HTTP) proxy connections
size of blacklist, i.e. how many sessions are currently blacklisted
number of routes configured
number of requests against the proxy (is that what the Prometheus Summary is for?)
number of processed events on that node (e.g., events per second) and how many of them have been forwarded vs. how many have been dropped
number of websocket connections vs number of SSE connections

I'm not 100% sure on were to track the metrics as I'm not 100% aware of the RIG implementation in general. Could someone propose were to implement?

I think the instrumentation modules should live next to what they're measuring: if the measured thing changes, the corresponding metrics might change as well, so it keeps the changes local. That said, I would throw everything into rig_metrics for now and move the instrumentation modules after the umbrella refactoring.

On a different matter, I'd prefer having a health endpoint that returns OK instead of querying the APIs endpoint - in other words I think the docs are good, it's just the implementation that is missing (which doesn't need to be in the same PR, of course).

Do you already have any plans for how to create the dashboard? Not that we'd strictly need one, but it would be pretty neat. Grafana seems to supported "scripted dashboards", but I haven't looked into alternatives yet (are there any?).

jakobwgnr · 2018-12-05T11:06:08Z

I wonder if/how they could be useful.

I thought that this could show the overall usage over time within grafana, but actually you'll see the same Information when looking on the current usage over time. So skipping that.

For certain proxy features we build a "correlation ID" that is basically the encoded process ID; the same mechanism could be used to create a "connection ID" for a given websocket or SSE connection. What would you use the label for?

The label would be used to be able to have a detailed view per Connection ID within grafana. If that's not really relevant we could skip that

The number of subscriptions per connection can be calculated by grouping the rows by pid, but I guess this will get a bit slow with the number of subscriptions: as one process (pid = process id) may have multiple subscriptions in the same table, the pid column is not unique => no index, slow counting. And if we'd want to report the number of subscriptions regardless of event types, we'd even have to count this for each table (again, one table per event type). So perhaps we'll skip that metric.

Actually the metrics will be increased/decreased when an subscription is added/removed by calling the respective function of the instrumenter. e.g.: by calling Metrics.ProxyInstrumenter.add_subscription The current count will be held by the instrumenter itself.

I think the instrumentation modules should live next to what they're measuring: if the measured thing changes, the corresponding metrics might change as well, so it keeps the changes local. That said, I would throw everything into rig_metrics for now and move the instrumentation modules after the umbrella refactoring.

Hmm, then we would need an dependency to ex_prometheus in every module as we need to implement the Prometheus.Metric module.
I'd see the Instrumenter itself is relatively static and won't really change once it's implemented (otherwise possible grafana dashboards could break if we change names, etc.) Just were to call the respective Instrumenter function might change. So I'd go for everything in rig_metrics for now.

On a different matter, I'd prefer having a health endpoint that returns OK instead of querying the APIs endpoint - in other words I think the docs are good, it's just the implementation that is missing (which doesn't need to be in the same PR, of course).

Understood - I'll undo the change & create an issue for that.

Do you already have any plans for how to create the dashboard? Not that we'd strictly need one, but it would be pretty neat. Grafana seems to supported "scripted dashboards", but I haven't looked into alternatives yet (are there any?).

There are a few examples for beam/elixir grafana Dashboards provided here
We could use that as a basis and modify for rig

Metrics planned as of latest discussion:

RigProxyInstrumenter

rig_current_session_count --> As Prometheus Gauge
rig_current_subscription_count --> As Prometheus Gauge
rig_current_open_proxy_connection_count --> As Prometheus Gauge
rig_proxy_requests_total --> As Prometheus Counter

RigEventhubInstrumenter

rig_processed_events_total --> As Prometheus counter + Label Node
rig_forwarded_events_total --> As Prometheus counter + Label Node
rig_dropped_events_total --> As Prometheus counter + Label Node

RigControlInstrumenter

rig_current_session_blacklisted_count --> As Prometheus Gauge
rig_current_routes_configured_count --> As Prometheus Gauge

kevinbader · 2018-12-05T21:55:30Z

Actually the metrics will be increased/decreased when an subscription is added/removed by calling the respective function of the instrumenter. e.g.: by calling Metrics.ProxyInstrumenter.add_subscription The current count will be held by the instrumenter itself.

Yeah you're right. I assume that when the connection goes down the counter (Gauge) is decreased again. But what if the connection process crashes? Perhaps we might just assume that this doesn't happen too often and should be ignored, but the measurement won't be reliable then. Unfortunately, Erlang monitors cannot be used here as they don't perform well when creating many processes in a short period of time (at least according to my initial benchmarks).

Hmm, then we would need an dependency to ex_prometheus in every module as we need to implement the Prometheus.Metric module.
I'd see the Instrumenter itself is relatively static and won't really change once it's implemented (otherwise possible grafana dashboards could break if we change names, etc.) Just were to call the respective Instrumenter function might change. So I'd go for everything in rig_metrics for now.

Sure, I think I've mixed that up as well. The instrumeters don't change and the code that actually invokes the instrumenters lives along with what they measure, right? Seems fine to me that way.

Understood - I'll undo the change & create an issue for that.

Thanks!

There are a few examples for beam/elixir grafana Dashboards provided here
We could use that as a basis and modify for rig

really cool, didn't know about this.

Might be details, but wouldn't it make more sense to have rig_session_gauge instead of rig_current_session_count (same for subscriptions)? And perhaps rig_current_open_proxy_connection_count could be called rig_proxy_connection_gauge and rig_proxy_requests_total renamed to rig_proxy_connection_counter. What do you think?

jakobwgnr · 2018-12-05T22:04:39Z

Thanks for your Feedback.

Regarding naming, I think i missunderstood the Prometheus naming conventions & Counter/Gauge/etc. metrics types are Prometheus core concepts so should be known to everyone who wants to use them.

I'll rename!

kevinbader · 2018-12-05T22:16:32Z

Just an idea, seems to make sense to me.. However, I don't think you've "misunderstood" anything - indeed, the official naming conventions don't take the same approach. The docs have examples where Counters are post-fixed with _count and Gauges are in stated with nouns and verbs; for example: rig_proxy_requests_total and rig_proxy_clients_connected. Looks nice as well.

btw, quite interesting: https://promcon.io/2017-munich/slides/best-practices-and-beastly-pitfalls.pdf
Note the "unbounded label values will blow up Prometheus" - suddenly using connection IDs doesn't seem like a good idea anymore :D

jakobwgnr · 2018-12-09T21:40:05Z

Hi @kevinbader I have problems with writing tests for this… e.g. for blacklisted sessions they are cleaned up by a separate process (within :rig) and i actually don't know how to manually trigger this within :rig_tests…
I have tested this (by using :timer.sleep())& it's not running automatically.

Do you have an idea how to do this? Or is ok to have not that high test coverage for this?

kevinbader · 2018-12-10T08:35:48Z

@vanillebear does this help? https://github.com/Accenture/reactive-interaction-gateway/blob/master/apps/rig_auth/test/blacklist_test.exs#L37

jakobwgnr · 2018-12-10T14:34:50Z

@kevinbader Unfortunately not... The current Metrics handling is implemented in Distributed_set.ex within "remove_expired_records" not in Rig_Auth

I tried using this, but still the remove_expired_records isn't called?
{:ok, _pid} = DistributedSet.start_link(MySet, name: MySet)
:ok = DistributedSet.add(MySet, "SomeSessionId_1", 1)
false = DistributedSet.has?(MySet, "foo", shift_time_s: 2)
:ok = GenServer.stop(MySet)

Sorry for this… Unfortunately I'm not quite sure/can't find how this method is called in general so I'm not sure how to test :/

Maybe also the wrong place to track the metrics?

EDIT: I think I'll move the implementation to Blacklist.ex

kevinbader · 2018-12-10T16:07:47Z

In general I think DistributedSet feels like the wrong place to track metrics, as it's basically a data structure implementation and not related to any (business) functionality.

Thinking out loud, an integration test would probably look somewhat like this:

blacklist something using RIG's API and set ttl_s in the request to 1.
check that the metrics show one session blacklisted.
wait for 1 second. <-- sleeping in tests is so ugly... perhaps we find something better than this. For unit tests, Blacklist already allows one to subscribe to the expiration of a blacklist entry. Perhaps we should also provide a way to subscribe to any blacklist-related events (add, timeout) with respect to a certain session.

So what about this then instead:

Subscribe to the blacklist given a certain session ID (= JWT jti).
Blacklist that jti using RIG's API and set ttl_s in the request to 1.
Assert that the metrics show 1 session blacklisted.
Wait at most 2 seconds for the expiration event, throw otherwise.
Assert that the metrics show 0 session blacklisted.

jakobwgnr · 2019-02-05T12:36:22Z

Hello @kevinbader, @mmacai:
short summary on this as discussed.

We aligned that this issue should be used for initial setup of prometheus exporting the /metrics on rig_api

I'll create an separate issue for implementation of rig specific metrics.

Also for the problematic "Blacklist" implementation I've created an separate issue --> #156 just not to forget

kevinbader · 2019-02-05T12:48:19Z

awesome, thanks!

kevinbader added the enhancement label Oct 11, 2018

kevinbader added this to the 2.1.0 milestone Oct 11, 2018

kevinbader modified the milestones: 2.1.0 a.k.a. later, 2.0.0-beta.4 Nov 6, 2018

kevinbader mentioned this issue Nov 25, 2018

Time for a new umbrella #102

Closed

kevinbader modified the milestones: 2.0.0-beta.4, 2.2.0 Nov 26, 2018

kevinbader assigned kevinbader and mmacai Feb 5, 2019

kevinbader modified the milestones: 2.1.0, 2.0.3 Feb 5, 2019

jakobwgnr mentioned this issue Feb 5, 2019

Unclear/Broken Blacklist implementation #156

Closed

jakobwgnr mentioned this issue Feb 5, 2019

Feature 96 "initial" promotheus integration #158

Merged

kevinbader closed this as completed in #158 Feb 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus integration #96

Prometheus integration #96

kevinbader commented Oct 11, 2018 •

edited

jakobwgnr commented Nov 25, 2018

kevinbader commented Nov 25, 2018

jakobwgnr commented Dec 4, 2018

kevinbader commented Dec 4, 2018

jakobwgnr commented Dec 5, 2018 •

edited

kevinbader commented Dec 5, 2018

jakobwgnr commented Dec 5, 2018

kevinbader commented Dec 5, 2018 •

edited

jakobwgnr commented Dec 9, 2018

kevinbader commented Dec 10, 2018

jakobwgnr commented Dec 10, 2018 •

edited

kevinbader commented Dec 10, 2018

jakobwgnr commented Feb 5, 2019

kevinbader commented Feb 5, 2019

Prometheus integration #96

Prometheus integration #96

Comments

kevinbader commented Oct 11, 2018 • edited

jakobwgnr commented Nov 25, 2018

kevinbader commented Nov 25, 2018

jakobwgnr commented Dec 4, 2018

kevinbader commented Dec 4, 2018

jakobwgnr commented Dec 5, 2018 • edited

kevinbader commented Dec 5, 2018

jakobwgnr commented Dec 5, 2018

kevinbader commented Dec 5, 2018 • edited

jakobwgnr commented Dec 9, 2018

kevinbader commented Dec 10, 2018

jakobwgnr commented Dec 10, 2018 • edited

kevinbader commented Dec 10, 2018

jakobwgnr commented Feb 5, 2019

kevinbader commented Feb 5, 2019

kevinbader commented Oct 11, 2018 •

edited

jakobwgnr commented Dec 5, 2018 •

edited

kevinbader commented Dec 5, 2018 •

edited

jakobwgnr commented Dec 10, 2018 •

edited