Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus endpoint #7900

Merged
merged 14 commits into from
Dec 15, 2019
Merged

Conversation

vdimir
Copy link
Member

@vdimir vdimir commented Nov 24, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry:

  • Add Prometheus support.

Detailed description (optional):

ClickHouse serves metrics in Prometheus format on specified endpoint.
Close #7369

@vdimir
Copy link
Member Author

vdimir commented Nov 24, 2019

Initial implementation very basic, and I want to discuss some question.

Prometheus support Histogram and Summary metric types.
Should we use one of them (e.g. for CurrentMetrics)?
It may require collecting and analyzing metrics in background (now it scrapped on demand when Prometheus client request for it) for calculating buckets or quantiles.

I make changes in HttpHandlerFactory to support different handles (Prometheus one of them).
But implementation may seems over-complicated, we may want to make some changes in it, or revert some changes and use different approach.

Should AsyncMetrics be added to Prometheus ad in Graphite exporter?
Or some data from system.parts, e.g. disk usage for each table or number of rows ?

New PrometheusMetricsWriter class contains some common parts with MetricsTransiter.
Do we need to merge entities that aggregate metrics into one or several with common parts?

Does test required for new functionality? I think that implementing integration test possible, but we need to decide what exactly we want to check.

@vdimir
Copy link
Member Author

vdimir commented Nov 24, 2019

Also I add metric names in snake case following prometheus guidelines. I wrote names directly in source code, because some metrics should contain units suffix (or _total) and some names needs to be changed to plural form.

@zhang2014
Copy link
Contributor

Can it be implemented through a configuration of this #7572 ?

@abyss7 abyss7 added the pr-feature Pull request with new product feature label Nov 25, 2019
@abyss7 abyss7 self-assigned this Nov 25, 2019
@abyss7
Copy link
Contributor

abyss7 commented Nov 25, 2019

Should AsyncMetrics be added to Prometheus ad in Graphite exporter?
Or some data from system.parts, e.g. disk usage for each table or number of rows ?

For now we suggest that only metrics and events contain all that we need.

@alexey-milovidov
Copy link
Member

Initial implementation very basic, and I want to discuss some question.

The implementation looks very sound.

Prometheus support Histogram and Summary metric types.
Should we use one of them (e.g. for CurrentMetrics)?
It may require collecting and analyzing metrics in background (now it scrapped on demand when Prometheus client request for it) for calculating buckets or quantiles.

Better to finish this PR and postpone implementation of derived types of metrics.

Should AsyncMetrics be added to Prometheus ad in Graphite exporter?

AsyncMetrics should be added exactly in the same way as for Graphite - for consistency.

Does test required for new functionality? I think that implementing integration test possible, but we need to decide what exactly we want to check.

Yes, we need a test. A test will check that the server responds for prometheus handle and the response has some expected value of some metric (not necessarily to check for exact value, just check the presence of some metric and that value is greater than something).

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Nov 26, 2019

@zhang2014

Can it be implemented through a configuration of this #7572 ?

Yes, this feature along with Template format looks powerful enough to implement something like Prometheus handler without any specialized code. But I'm not sure. Some custom logic like converting names to snake case or calculating deltas from previous values still require additional logic. Let's try to do it as an example. PS. I think, that custom code is Ok

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Nov 26, 2019

@abyss7

For now we suggest that only metrics and events contain all that we need.

AsynchronousMetrics contain valuable information for monitoring. If we don't add them, then Prometheous support will be inferior than Graphite support.

@alex-zaitsev
Copy link
Contributor

@vdimir , your implementation hardcodes metric names and description. It would require to remember adding new metrics in monitoring when something is added in ClickHouse. Outside of ClickHouse implementations (https://github.com/f1yegor/clickhouse_exporter/ or inside https://github.com/Altinity/clickhouse-operator) try to create metrics on the fly from system tables. Could you elaborate why you decided to hardcode, please?

@filimonov
Copy link
Contributor

Outside of ClickHouse implementations (https://github.com/f1yegor/clickhouse_exporter/ or inside https://github.com/Altinity/clickhouse-operator) try to create metrics on the fly from system tables.

Yes. One line instead introducing extra param seems much simpler:

Also, both exporters provide results of some extra queries as metrics (so the people migrating from those exporters will have less metrics). Maybe we can consider adding them (or similar) too:

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Nov 26, 2019

@filimonov

Also, both exporters provide results of some extra queries as metrics (so the people migrating from those exporters will have less metrics). Maybe we can consider adding them (or similar)

This should be implemented in a separate PR. The goal of this PR is to add Prometheus handler to provide the same set of metrics that we already support for Graphite.

@vdimir
Copy link
Member Author

vdimir commented Nov 26, 2019

@vdimir , your implementation hardcodes metric names and description. It would require to remember adding new metrics in monitoring when something is added in ClickHouse. Outside of ClickHouse implementations (https://github.com/f1yegor/clickhouse_exporter/ or inside https://github.com/Altinity/clickhouse-operator) try to create metrics on the fly from system tables. Could you elaborate why you decided to hardcode, please?

I add snake case names because some of them needs to be changed in plural form, or some suffix should be added. (my comment above).

I considered automatic conversion to snake case too, but I haven't decide what is better.
Ok, I'll change it.

@zhang2014
Copy link
Contributor

zhang2014 commented Nov 27, 2019

Yes, this feature along with Template format looks powerful enough to implement something like Prometheus handler without any specialized code. But I'm not sure. Some custom logic like converting names to snake case or calculating deltas from previous values still require additional logic. Let's try to do it as an example. PS. I think, that custom code is Ok

Custom code is fine. I'm just offering another way:

  • names to snake case: It maybe a function?
  • calculating deltas from previous values still require additional logic: Maybe metrics, events and AsynchronousMetrics can add some columns to provide the last two measurements? like uptime, this is useful for some other monitoring systems.

It looks like a good contribution anyway.

@alexey-milovidov
Copy link
Member

calculating deltas from previous values still require additional logic: Maybe metrics, events and AsynchronousMetrics can add some columns to provide the last two measurements? like uptime, this is useful for some other monitoring systems.

Three points:

  1. We already have system.metric_log table.
  2. The user decide for which period to accumulate the values (one second, minute, etc) - probably we should not hardcode a few such periods in tables.
  3. There are many derived metrics possible. The most popular are: exponential smoothed averages; quantiles of a set of measurements...

@zhang2014
Copy link
Contributor

zhang2014 commented Nov 27, 2019

The following configuration currently work in #7572

config.xml:

<predefine_query_handler>
    <url>/metrics</url>
    <method>GET</method>
    <queries>
        <query>SELECT * FROM system.metrics LIMIT 5 FORMAT Template SETTINGS format_template_resultset = 'prometheus_template_output_format_resultset', format_template_row = 'prometheus_template_output_format_row', format_template_rows_between_delimiter = '\n'</query>
    </queries>
</predefine_query_handler>

prometheus_template_output_format_resultset:

${data:None}

prometheus_template_output_format_row:

# HELP ${metric:CSV} ${description:CSV}
# TYPE ${metric:CSV} counter
${metric:CSV} ${value:CSV}

Result:

curl -vvv 'http://localhost:8123/metrics'
*   Trying ::1...
* Connected to localhost (::1) port 8123 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:8123
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 27 Nov 2019 08:54:25 GMT
< Connection: Keep-Alive
< Content-Type: text/plain; charset=UTF-8
< X-ClickHouse-Server-Display-Name: i-tl62qd0o
< Transfer-Encoding: chunked
< X-ClickHouse-Query-Id: f39235f6-6ed7-488c-ae07-c7ceafb960f6
< Keep-Alive: timeout=3
< X-ClickHouse-Summary: {"read_rows":"0","read_bytes":"0","written_rows":"0","written_bytes":"0","total_rows_to_read":"0"}
< 
# HELP "Query" "Number of executing queries"
# TYPE "Query" counter
"Query" 1

# HELP "Merge" "Number of executing background merges"
# TYPE "Merge" counter
"Merge" 0

# HELP "PartMutation" "Number of mutations (ALTER DELETE/UPDATE)"
# TYPE "PartMutation" counter
"PartMutation" 0

# HELP "ReplicatedFetch" "Number of data parts being fetched from replica"
# TYPE "ReplicatedFetch" counter
"ReplicatedFetch" 0

# HELP "ReplicatedSend" "Number of data parts being sent to replicas"
# TYPE "ReplicatedSend" counter
"ReplicatedSend" 0

* Connection #0 to host localhost left intact

@zhang2014
Copy link
Contributor

We already have system.metric_log table.
The user decide for which period to accumulate the values (one second, minute, etc) - probably we should not hardcode a few such periods in tables.
There are many derived metrics possible. The most popular are: exponential smoothed averages; quantiles of a set of measurements...

Can we add AsynchronousMetrics to the system.metric_log ?

@alexey-milovidov
Copy link
Member

Can we add AsynchronousMetrics to the system.metric_log ?

We have not added it, because they are updated in longer intervals (60 seconds by default) in contrast to other values, that are always up to date and can be written as frequent as needed.

@alexey-milovidov
Copy link
Member

Can we avoid duplicating metric names in snake_case?
snake_case is just a guideline from Prometheus, not a strict requirement.

@alexey-milovidov alexey-milovidov added the st-need-info We need extra data to continue (waiting for response) label Dec 3, 2019
@alexey-milovidov
Copy link
Member

Is it acceptable to keep metric names in CamelCase without any changes? To keep the code more maintenable.

@vdimir
Copy link
Member Author

vdimir commented Dec 10, 2019

Is it acceptable to keep metric names in CamelCase without any changes? To keep the code more maintenable.

I have not enough experience with Prometheus to decide which solution more acceptable. But now I reverted names to CamelCase and ready to implement other approach for naming if we'll find it.

Other uncompleted part of this task is adding integration test. I have some issue with running it on MacOS, because I couldn't build ClickHouse natively and use docker with Ubuntu. But integration tests require docker itself which is not available because I already in container. I am interested to figure out machinery with integration tests and finish this task, so I hope that I find some solution for it.

@azat
Copy link
Collaborator

azat commented Dec 15, 2019

Is it acceptable to keep metric names in CamelCase without any changes?

Another reason for keeping the metric names - it will allow use various dashboards without any changes

@alexey-milovidov alexey-milovidov merged commit d498e14 into ClickHouse:master Dec 15, 2019
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Dec 15, 2019

About the choice of default port in configuration example, 8001.
Is it reasonable?

We should default to some most widely used port for that type of handles or simply reuse http_port.

@alexey-milovidov
Copy link
Member

2019.12.16 01:47:49.789705 [ 1 ] {} <Information> Application: Listening http://[::1]:8001
not informative enough.

@alexey-milovidov
Copy link
Member

But works perfectly.

alexey-milovidov added a commit that referenced this pull request Dec 15, 2019
@vdimir
Copy link
Member Author

vdimir commented Dec 16, 2019

About the choice of default port in configuration example, 8001.
Is it reasonable?

We should default to some most widely used port for that type of handles or simply reuse http_port.

Common Prometheus port is 9090 (link), I had to use it in prometheus.port.
If prometheus section in config doesn't contain port then http_port would be used.

2019.12.16 01:47:49.789705 [ 1 ] {} <Information> Application: Listening http://[::1]:8001
not informative enough.

Ok, I'll make response more verbose and send other PR.
Also I'll change prometheus default settings port to 9090.

UPD: I didn't notice that you already change response text.

@azat
Copy link
Collaborator

azat commented Dec 16, 2019

Common Prometheus port is 9090

Hm, but this is prometheus server, not the exporter (while this is an exporter)

P.S. this is just a note, no objections against 9090 port

@vdimir
Copy link
Member Author

vdimir commented Dec 16, 2019

Hm, but this is prometheus server, not the exporter (while this is an exporter)

Thanks for note, I was wrong. Exporters uses ports starting at 9100.
https://prometheus.io/docs/instrumenting/writing_exporters/#port-numbers

Feel free to grab the next free port number when developing your exporter, preferably before publicly announcing it. If you’re not ready to release yet, putting your username and WIP is fine.

@alexey-milovidov
Copy link
Member

Ok. Waiting for a PR!

response more verbose

I already made it.

@azat
Copy link
Collaborator

azat commented Dec 16, 2019

Also there are problems with jemalloc statistics, there is . in it, i.e.:

ClickHouseAsyncMetricsjemalloc.background_thread.run_interval 0
ClickHouseAsyncMetricsjemalloc.background_thread.num_runs 0
ClickHouseAsyncMetricsjemalloc.background_thread.num_threads 0
ClickHouseAsyncMetricsjemalloc.metadata 66363096
ClickHouseAsyncMetricsjemalloc.active 3122552832
ClickHouseAsyncMetricsjemalloc.retained 2311983104
ClickHouseAsyncMetricsjemalloc.metadata_thp 0
ClickHouseAsyncMetricsjemalloc.resident 7604621312
ClickHouseAsyncMetricsjemalloc.mapped 8007053312
ClickHouseAsyncMetricsjemalloc.allocated 3066734960

@alexey-milovidov
Copy link
Member

Two suggestions:

  1. Let's add underscores after metric group: ClickHouseAsyncMetrics_
  2. Let's replace dots with underscores: jemalloc_background_thread_run_interval

@ut0mt8
Copy link

ut0mt8 commented May 15, 2020

Hello, this looks cool. Just few comments/questions : so we have a basic /metrics endpoints which answer so hardcoded queries and we can add some predefined_queries in http_handler conf.
Great but it would be better if we can add predefined_queries at the end of the same path /metrics endpoints (unless I miss something).
Also it seems it doesn't support prometheus labels , which is a must in the prom world.
Good work anyway.

@vdimir
Copy link
Member Author

vdimir commented May 15, 2020

Hello, this looks cool. Just few comments/questions : so we have a basic /metrics endpoints which answer so hardcoded queries and we can add some predefined_queries in http_handler conf.

Great but it would be better if we can add predefined_queries at the end of the same path /metrics endpoints (unless I miss something).

Do you mean same as #10043 ?

Also it seems it doesn't support prometheus labels , which is a must in the prom world.

Could you explain, what types of labels should be added and how it should be set?

@ut0mt8
Copy link

ut0mt8 commented May 15, 2020

In prometheus label can be anything.
In our queries we wants labels for fields in the group by clause.

For example :

select sum(somefield) as super_stat , pid from stats_table group by pid;

would result in :

super_stat{pid="somepid"} XX

So yep add more custom queries possible for the same htttp route / path (/metrics) and add the possibilities to have more custom output.
No idea of the grammar so far.

@vdimir vdimir deleted the prometheus-metrics-7369 branch May 16, 2020 07:20
@zhang2014
Copy link
Contributor

So yep add more custom queries possible for the same htttp route / path (/metrics) and add the possibilities to have more custom output.

@ut0mt8 Maybe HTTP handler config + template format will help you ?

reference:
#7900 (comment)

https://clickhouse.tech/docs/en/interfaces/http/#predefined_query_handler

https://clickhouse.tech/docs/en/interfaces/formats/#format-template

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature st-need-info We need extra data to continue (waiting for response)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add native Prometheus metrics endpoint
8 participants