Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring prometheus module to aggregate metrics based on metric family #4075

Merged
merged 2 commits into from Apr 21, 2017

Conversation

@vjsamuel
Copy link
Contributor

commented Apr 21, 2017

The current prometheus collector implementation separates each metric into a separate event. This is not how prometheus is meant to be understood. Prometheus has the concept of MetricFamily.

Example of a metric family would be:

# TYPE apiserver_request_latencies histogram
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="125000"} 11542
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="250000"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="500000"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="1e+06"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="2e+06"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="4e+06"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="8e+06"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="+Inf"} 11543
apiserver_request_latencies_sum{resource="accounts",verb="GET"} 1.7285094e+07
apiserver_request_latencies_count{resource="accounts",verb="GET"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="125000"} 566
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="250000"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="500000"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="1e+06"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="2e+06"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="4e+06"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="8e+06"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="+Inf"} 567
apiserver_request_latencies_sum{resource="accounts",verb="LIST"} 4.695524e+06
apiserver_request_latencies_count{resource="accounts",verb="LIST"} 567

This metric family can be broken down as:
metric_name: apiserver_request_latencies
metric_type: histogram

It has two different label combinations:
{resource="accounts",verb="LIST"} and {resource="accounts",verb="GET"}

Hence two events ought to be created. one of which would look similar to:

      "apiserver_request_latencies": {
        "buckets": {
          "+Inf": 3696,
          "1000000": 0,
          "125000": 0,
          "2000000": 0,
          "250000": 0,
          "4000000": 0,
          "500000": 0,
          "8000000": 0
        },
        "count": 3696,
        "sum": 1668259775362.000000
      },
      "labels": {
        "resource": "roles",
        "verb": "WATCHLIST"
      }
    }
  }

This enables histograms and summaries to be looked adjacent to each other.

@vjsamuel vjsamuel force-pushed the vjsamuel:prometheus_refactor branch from 7a9e535 to 974a687 Apr 21, 2017

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 21, 2017

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

1 similar comment
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 21, 2017

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

@ruflin
Copy link
Collaborator

left a comment

This is a really nice addition. It will heavily improve the collector metricset and create much fewer events. By using the prometheus client we also have kind of a guarantee that it keeps working.

@@ -78,6 +78,8 @@ https://github.com/elastic/beats/compare/v5.1.1...master[Check the HEAD diff]
- Make system process metricset honor the cpu_ticks config option. {issue}3590[3590]
- Support common.Time in mapstriface.toTime() {pull}3812[3812]
- Fixing panic on prometheus collector when label has , {pull}3947[3947]
- Fixing prometheus collector to aggregate metrics based on metric family. {pull}4075[4075]

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

I would not put this under bugfixes but breaking changes as it changes the data structure.

@@ -6,6 +6,9 @@ import (
"github.com/elastic/beats/metricbeat/helper"
"github.com/elastic/beats/metricbeat/mb"
"github.com/elastic/beats/metricbeat/mb/parse"

"fmt"

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

could you move the fmt on the top to have the standard imports together?

}

eventList[promEvent.labelHash][promEvent.key] = promEvent.value

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

I assume the promEvent.key are quite constant over time so we don't have a field explosion here.

This comment has been minimized.

Copy link
@vjsamuel

vjsamuel Apr 21, 2017

Author Contributor

+1

labels: common.MapStr{
"handler": "query",
"quantile": 0.99,
key: "http_request_duration_microseconds",

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

In a future step we could also extract the "units" part like microseconds as I assume this is also part of the convention.

"github.com/elastic/beats/libbeat/common"
dto "github.com/prometheus/client_model/go"
"math"

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

can you move these 2 to the top?

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

Our general logic is for imports:

standard imports

beats imports

external imports
for _, metric := range metrics {
event := PromEvent{
key: name,
labelHash: "#",

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

How does that exactly work?

This comment has been minimized.

Copy link
@vjsamuel

vjsamuel Apr 21, 2017

Author Contributor

this makes sure that all metrics that dont have any tag values gets grouped into a single document. its a carry over from the previous implementation that was there before this change.

key := strconv.FormatFloat((100 * quantile.GetQuantile()), 'f', -1, 64)

if math.IsNaN(quantile.GetValue()) == false {
percentileMap["p"+key] = quantile.GetValue()

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

As it is already under percentile I don't think we ned to add a p in front of the key.

This comment has been minimized.

Copy link
@vjsamuel

vjsamuel Apr 21, 2017

Author Contributor

done

}

if len(percentileMap) != 0 {
value["percentiles"] = percentileMap

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

Lets make it percentile as then it reads precentile.99: 0.2

This comment has been minimized.

Copy link
@vjsamuel

vjsamuel Apr 21, 2017

Author Contributor

done

bucketMap[key] = bucket.GetCumulativeCount()
}

value["buckets"] = bucketMap

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

Same here, lets make it bucket

This comment has been minimized.

Copy link
@vjsamuel

vjsamuel Apr 21, 2017

Author Contributor

done


import (
"fmt"
dto "github.com/prometheus/client_model/go"

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

see import rules above

This comment has been minimized.

Copy link
@vjsamuel

vjsamuel Apr 21, 2017

Author Contributor

done.

@@ -13,11 +13,13 @@
},
"prometheus": {
"collector": {
"label": {
"labels": {

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

I would prefer to have here label even though I'm aware this is not consitent with most of the other label fields with have. They should also be singular.

This comment has been minimized.

Copy link
@vjsamuel

vjsamuel Apr 21, 2017

Author Contributor

done

@vjsamuel vjsamuel force-pushed the vjsamuel:prometheus_refactor branch from 52b9a68 to f45e386 Apr 21, 2017

@vjsamuel vjsamuel force-pushed the vjsamuel:prometheus_refactor branch from f45e386 to a5ee39f Apr 21, 2017

@ruflin
ruflin approved these changes Apr 21, 2017
Copy link
Collaborator

left a comment

LGTM. @tsg could you also have a look at this one as I remember we had quite a few discussions about the prometheus format ...

@@ -79,6 +80,7 @@ https://github.com/elastic/beats/compare/v5.1.1...master[Check the HEAD diff]
- Support common.Time in mapstriface.toTime() {pull}3812[3812]
- Fixing panic on prometheus collector when label has , {pull}3947[3947]
- Fix MongoDB dbstats fields mapping. {pull}4025[4025]
- Fixing prometheus collector to aggregate metrics based on metric family. {pull}4075[4075]

This comment has been minimized.

Copy link
@ruflin

ruflin Apr 21, 2017

Collaborator

This one should now be removed.

@tsg

This comment has been minimized.

Copy link
Collaborator

commented Apr 21, 2017

@ruflin @vjsamuel The new document organization makes a lot of sense to me, and it's great we can group more metrics together. 👍 👍 👍

@ruflin ruflin merged commit aad3dbb into elastic:master Apr 21, 2017

5 checks passed

CLA Commit author has signed the CLA
Details
codecov/patch 75% of diff hit (target 64.46%)
Details
codecov/project 64.71% (+0.24%) compared to fb92dee
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@vjsamuel vjsamuel deleted the vjsamuel:prometheus_refactor branch Apr 21, 2017

athom added a commit to athom/beats that referenced this pull request Jan 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.