Histogram field type is not compatible with the prometheus histograms #99820

tetianakravchenko · 2023-09-22T15:14:57Z

Prometheus histograms are transformed to the Elasticsearch histograms using PromHistogramToES

counts array is composed by undoing counters accumulation for each bucket.

But anyway there still might happen situation when difference between two long values doesn’t fit into elasticsearch counts type - int, that cause such error:

{"type":"x_content_parse_exception","reason":"[1:2070] Numeric value (59997430461) out of range of int (-2147483648 - 2147483647)\n

and the document will be dropped.

Full message:

Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2023, time.September, 22, 12, 19, 26, 601000000, time.Local), Meta:{"input_id":"prometheus/metrics-prometheus-635762b7-7446-4dbd-8b2e-0219e661c290","raw_index":"metrics-prometheus.remote_write-default","stream_id":"prometheus/metrics-prometheus.remote_write-635762b7-7446-4dbd-8b2e-0219e661c290"}, Fields:{"agent":{"ephemeral_id":"839cd17d-903e-4b98-b2d9-24f85a96705a","id":"24474773-c1c5-4dc7-ba53-bb66e56726c6","name":"gke-tetianakravchenko-pr-default-pool-0e95df92-btpw","type":"metricbeat","version":"8.9.2"},"cloud":{"account":{"id":"elastic-observability"},"availability_zone":"us-central1-c","instance":{"id":"7106047343117859434","name":"gke-tetianakravchenko-pr-default-pool-0e95df92-btpw"},"machine":{"type":"e2-medium"},"project":{"id":"elastic-observability"},"provider":"gcp","region":"us-central1","service":{"name":"GCE"}},"data_stream":{"dataset":"prometheus.remote_write","namespace":"default","type":"metrics"},"ecs":{"version":"8.0.0"},"elastic_agent":{"id":"24474773-c1c5-4dc7-ba53-bb66e56726c6","snapshot":false,"version":"8.9.2"},"event":{"dataset":"prometheus.remote_write","module":"prometheus"},"host":{"architecture":"x86_64","containerized":false,"hostname":"gke-tetianakravchenko-pr-default-pool-0e95df92-btpw","id":"7cf9ef1a908c8254d6aebf6c7bba6827","ip":["10.128.0.28","fe80::4001:aff:fe80:1c","169.254.123.1","10.104.2.1","fe80::4c1e:a0ff:fe4b:cd4e","10.104.2.1","fe80::982f:49ff:fe2d:c8f","10.104.2.1","fe80::4ef:e7ff:fe90:222e","10.104.2.1","fe80::1003:60ff:feee:1e87","10.104.2.1","fe80::907e:adff:fe38:8d5c","10.104.2.1","fe80::1451:b3ff:fee9:eb00","10.104.2.1","fe80::648c:17ff:fe38:988"],"mac":["02-42-30-42-7A-BB","06-EF-E7-90-22-2E","12-03-60-EE-1E-87","16-51-B3-E9-EB-00","42-01-0A-80-00-1C","4E-1E-A0-4B-CD-4E","66-8C-17-38-09-88","92-7E-AD-38-8D-5C","9A-2F-49-2D-0C-8F"],"name":"gke-tetianakravchenko-pr-default-pool-0e95df92-btpw","os":{"codename":"focal","family":"debian","kernel":"5.15.109+","name":"Ubuntu","platform":"ubuntu","type":"linux","version":"20.04.6 LTS (Focal Fossa)"}},"metricset":{"name":"remote_write"},"orchestrator":{"cluster":{"name":"tetianakravchenko-prometheus-test","url":"https://10.128.0.18"}},"prometheus":{"apiserver_flowcontrol_priority_level_request_utilization":{"histogram":{"counts":[59997430461,2726772,0,0,0,0,0,0,0,0,0],"values":[0,0.0005,0.002,0.006500000000000001,0.019999999999999997,0.065,0.175,0.375,0.625,0.875,1.25]}},"apiserver_flowcontrol_priority_level_request_utilization_count":{"counter":257267325323388},"apiserver_flowcontrol_priority_level_request_utilization_sum":{"counter":3146017.521250039},"labels":{"instance":"10.128.0.18:443","job":"kubernetes-apiservers","phase":"waiting","priority_level":"workload-high"}},"service":{"type":"prometheus"}}, Private:interface {}(nil), TimeSeries:true}, Flags:0x0, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {"type":"document_parsing_exception","reason":"[1:2059] failed to parse field [prometheus.apiserver_flowcontrol_priority_level_request_utilization.histogram] of type [histogram]","caused_by":{"type":"x_content_parse_exception","reason":"[1:2070] Numeric value (59997430461) out of range of int (-2147483648 - 2147483647)\n at [Source: (byte[])\"{\"agent\":{\"name\":\"gke-tetianakravchenko-pr-default-pool-0e95df92-btpw\",\"id\":\"24474773-c1c5-4dc7-ba53-bb66e56726c6\",\"type\":\"metricbeat\",\"ephemeral_id\":\"839cd17d-903e-4b98-b2d9-24f85a96705a\",\"version\":\"8.9.2\"},\"elastic_agent\":{\"id\":\"24474773-c1c5-4dc7-ba53-bb66e56726c6\",\"version\":\"8.9.2\",\"snapshot\":false},\"cloud\":{\"availability_zone\":\"us-central1-c\",\"instance\":{\"name\":\"gke-tetianakravchenko-pr-default-pool-0e95df92-btpw\",\"id\":\"7106047343117859434\"},\"provider\":\"gcp\",\"service\":{\"name\":\"GCE\"},\"machin\"[truncated 2100 bytes]; line: 1, column: 2070]","caused_by":{"type":"input_coercion_exception","reason":"Numeric value (59997430461) out of range of int (-2147483648 - 2147483647)\n at [Source: (byte[])\"{\"agent\":{\"name\":\"gke-tetianakravchenko-pr-default-pool-0e95df92-btpw\",\"id\":\"24474773-c1c5-4dc7-ba53-bb66e56726c6\",\"type\":\"metricbeat\",\"ephemeral_id\":\"839cd17d-903e-4b98-b2d9-24f85a96705a\",\"version\":\"8.9.2\"},\"elastic_agent\":{\"id\":\"24474773-c1c5-4dc7-ba53-bb66e56726c6\",\"version\":\"8.9.2\",\"snapshot\":false},\"cloud\":{\"availability_zone\":\"us-central1-c\",\"instance\":{\"name\":\"gke-tetianakravchenko-pr-default-pool-0e95df92-btpw\",\"id\":\"7106047343117859434\"},\"provider\":\"gcp\",\"service\":{\"name\":\"GCE\"},\"machin\"[truncated 2100 bytes]; line: 1, column: 2070]"}}}, dropping event!

"prometheus":{
    "apiserver_flowcontrol_priority_level_request_utilization":{
        "histogram":{
            "counts":[
                59997430461,
                2726772,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0
            ],
            "values":[
                0,
                0.0005,
                0.002,
                0.006500000000000001,
                0.019999999999999997,
                0.065,
                0.175,
                0.375,
                0.625,
                0.875,
                1.25
            ]
        }
    }

the problem is with the first value (I’ve checked other dropped events - there is the same behavior)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-09-22T18:25:51Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

salvatore-campagna · 2023-09-25T08:25:14Z

As you can see from the example the first bucket is the counts for values whose value is less then or equal to 0. If I understand correctly the way you decided to handle negative bucket values you should accumulate the counts from the first two buckets so to have

le=0.005, counts= 59997430461 + 2726772

Anyway this should overflow anyway because our histogram type supports double values and long counts.

tetianakravchenko · 2023-09-25T08:46:00Z

If I understand correctly the way you decided to handle negative bucket values you should accumulate the counts from the first two buckets so to have

not sure why are you referring to the negative bucket values handling, here is an prometheus histogram metric:

apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0"} 5.02948066088585e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0.001"} 5.02961865361956e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0.003"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0.01"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0.03"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0.1"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0.25"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0.5"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="0.75"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="1"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_bucket{phase="waiting",priority_level="workload-low",le="+Inf"} 5.02962348792884e+14
apiserver_flowcontrol_priority_level_request_utilization_sum{phase="waiting",priority_level="workload-low"} 3.3128167345312536e+06
apiserver_flowcontrol_priority_level_request_utilization_count{phase="waiting",priority_level="workload-low"} 5.02962348792884e+14

there is no negative buckets.

you should accumulate the counts from the first two buckets so to have
le=0.005, counts= 59997430461 + 2726772

sorry, I didn't understand this, each element of counts is a deaccumulated value, here is a test file with some explanations how counts are calculated - https://github.com/elastic/beats/blob/243e6c0f3450282bb73067a98a30d5c718274803/x-pack/metricbeat/module/prometheus/collector/histogram_test.go#L392-L447

Anyway this should overflow anyway because our histogram type supports double values and long counts.

so the documentation is not correct saying that
A corresponding counts array of [integer](https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html) numbers, representing how many values fall into each bucket
?

kkrik-es · 2023-09-25T09:02:15Z

It seems like count is an int?

https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/fielddata/HistogramValue.java#L35

Histograms currently use integers to store the count of each value, which can overflow. Switch to using long integers to avoid this. TDigestState was updated to use long for centroid value count in elastic#99491 Fixes elastic#99820

tetianakravchenko · 2023-09-26T15:17:37Z

Hey @kkrik-es

is there any reason to not use an unsigned_long type for the counts? type used for the prometheus histogram is uint64. As I understand counts must be anyway positive or zero.

kkrik-es · 2023-09-27T06:10:22Z

I believe we already check programmatically that count values are not negative.

salvatore-campagna · 2023-09-27T06:57:34Z

Hey @kkrik-es

is there any reason to not use an unsigned_long type for the counts? type used for the prometheus histogram is uint64. As I understand counts must be anyway positive or zero.

Elasticsearch is written in Java...and unfortunately Java does not have unsigned types (unless we do something special to treat signed numbers as unsigned).

tetianakravchenko · 2023-09-27T08:01:40Z

Hey @kkrik-es

is there any reason to not use an unsigned_long type for the counts? type used for the prometheus histogram is uint64. As I understand counts must be anyway positive or zero.

Elasticsearch is written in Java...and unfortunately Java does not have unsigned types (unless we do something special to treat signed numbers as unsigned).

I was referring to this page - https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html#number and as I see unsigned_long is listed as supported numeric field types, so from my understanding there should be implemented something to treat signed numbers as unsigned. Anyway it is more a curiosity question and attempt to align to the prometheus histogram type

* Represent histogram value count as long Histograms currently use integers to store the count of each value, which can overflow. Switch to using long integers to avoid this. TDigestState was updated to use long for centroid value count in #99491 Fixes #99820 * Update docs/changelog/99912.yaml * spotless fix

kkrik-es · 2023-09-29T10:42:43Z

Kindly note that the submitted fix merely addresses the shortcoming of ES histograms with regard to supporting large count values per inserted point. Other incompatibilities with Prom histograms, e.g. around the use of cumulative values and different internal implementation for percentiles, persist.

* Represent histogram value count as long Histograms currently use integers to store the count of each value, which can overflow. Switch to using long integers to avoid this. TDigestState was updated to use long for centroid value count in elastic#99491 Fixes elastic#99820 * Update docs/changelog/99912.yaml * spotless fix

tetianakravchenko changed the title ~~Histogram field type is not compativle with the prometheus histograms~~ Histogram field type is not compatible with the prometheus histograms Sep 22, 2023

tetianakravchenko assigned salvatore-campagna Sep 22, 2023

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Sep 22, 2023

tetianakravchenko mentioned this issue Sep 22, 2023

[prometheus][remote_write] Failing to parse some histogram fields elastic/integrations#7893

Closed

javanna added :Analytics/Aggregations Aggregations >enhancement and removed needs:triage Requires assignment of a team area label labels Sep 22, 2023

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 22, 2023

javanna added :StorageEngine/TSDB You know, for Metrics and removed Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/Aggregations Aggregations labels Sep 22, 2023

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 22, 2023

kkrik-es assigned kkrik-es and unassigned salvatore-campagna Sep 26, 2023

kkrik-es mentioned this issue Sep 26, 2023

Represent histogram value count as long #99912

Merged

kkrik-es closed this as completed in #99912 Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram field type is not compatible with the prometheus histograms #99820

Histogram field type is not compatible with the prometheus histograms #99820

tetianakravchenko commented Sep 22, 2023

elasticsearchmachine commented Sep 22, 2023

salvatore-campagna commented Sep 25, 2023

tetianakravchenko commented Sep 25, 2023

kkrik-es commented Sep 25, 2023

tetianakravchenko commented Sep 26, 2023

kkrik-es commented Sep 27, 2023

salvatore-campagna commented Sep 27, 2023

tetianakravchenko commented Sep 27, 2023 •

edited

kkrik-es commented Sep 29, 2023 •

edited

Histogram field type is not compatible with the prometheus histograms #99820

Histogram field type is not compatible with the prometheus histograms #99820

Comments

tetianakravchenko commented Sep 22, 2023

elasticsearchmachine commented Sep 22, 2023

salvatore-campagna commented Sep 25, 2023

tetianakravchenko commented Sep 25, 2023

kkrik-es commented Sep 25, 2023

tetianakravchenko commented Sep 26, 2023

kkrik-es commented Sep 27, 2023

salvatore-campagna commented Sep 27, 2023

tetianakravchenko commented Sep 27, 2023 • edited

kkrik-es commented Sep 29, 2023 • edited

tetianakravchenko commented Sep 27, 2023 •

edited

kkrik-es commented Sep 29, 2023 •

edited