Metrics reporting unusually high number of written records #7218

korthout · 2021-06-08T12:21:18Z

Describe the bug

The Grafana dashboard is showing an unusually high number of written records (~3M compared to ~3K per minute). See screenshots.

This seems to coincide with the restart of broker zeebe-2, which at the time of the restart was leader for partition 2. Partitions 1 and 3 were lead by zeebe-0, which took over leadership of partition 2 after zeebe-2 went down.

Screenshots

Larger timeframe

Shows high peaks in processing per partition panel. Also note that written normally is about 3K per minute per partition.

Focussed timeframe

Shows unusually high numbers as written in the processing per partition panel.

To Reproduce

Unknown.

I've tried to reproduce it on the same benchmark by restarting pods individually, but without success.

Expected behavior

Failover should not lead to a report of additional written records

Environment:

Context: gke_zeebe-io_europe-west1-b_zeebe-cluster
Namespace: medic-cw-23-f01b5bc83-benchmark
Zeebe Version: f01b5bc
Configuration: benchmark

The text was updated successfully, but these errors were encountered:

korthout · 2021-06-11T07:49:35Z

Also seen in medic-cw-22-ce77f891b-benchmark on 2021-06-10 at 15:33:00 CEST -> grafana

Note, there is a leader change happening at that time, to make it visible in the dashboard, edit the role changes panel, and set the min interval of the prometheus query options to 1s.

7255: Increase role change granularity in Grafana r=korthout a=korthout ## Description  In some cases, the role change happens so fast and changes back to the previous state (in less than 15s). When that happens, grafana was unable to show the election cycle. By reducing the query interval for the datasource we can increase the granularity and show these role changes. The interval is only changed for the role change panel, to not impact performance. ## Related issues  relates to #7218 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

korthout · 2021-07-19T09:18:09Z

Looks like the lastWrittenEventPosition is low (e.g. 0 or -1), see: https://github.com/camunda-cloud/zeebe/blob/develop/engine/src/main/java/io/camunda/zeebe/engine/processing/streamprocessor/ProcessingStateMachine.java#L366-L367

Zelldon · 2021-12-07T12:59:07Z

The issue is that we calculate the amount here https://github.com/camunda-cloud/zeebe/blob/develop/engine/src/main/java/io/camunda/zeebe/engine/processing/streamprocessor/ProcessingStateMachine.java#L368

But the lastWrittenEventPosition can be zero on fail-over, which means that the amount is very high, since the writtenEventPosition is quite high.

Ideally we would set the lastWrittenEventPosition on fail-over, maybe to the last processed position (which we get from the state) or something similar.

8323: Reset last written position r=Zelldon a=Zelldon ## Description Previously the last written position was not reset on starting the processing state machine, which could cause weird metric exports, because the gap between lastWritten and newWritten position was to high (over million and more). The replay state machine now returns all positions and initialized the processing state machine with it.  ## Related issues  closes #7218 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

korthout added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Observability labels Jun 8, 2021

korthout mentioned this issue Jun 11, 2021

Increase role change granularity in Grafana #7255

Merged

9 tasks

Zelldon self-assigned this Dec 7, 2021

Zelldon mentioned this issue Dec 7, 2021

Reset last written position #8323

Merged

9 tasks

ghost closed this as completed in 172a2cc Dec 10, 2021

korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics reporting unusually high number of written records #7218

Metrics reporting unusually high number of written records #7218

korthout commented Jun 8, 2021

Larger timeframe

Focussed timeframe

korthout commented Jun 11, 2021 •

edited

Loading

korthout commented Jul 19, 2021

Zelldon commented Dec 7, 2021

Metrics reporting unusually high number of written records #7218

Metrics reporting unusually high number of written records #7218

Comments

korthout commented Jun 8, 2021

Larger timeframe

Focussed timeframe

korthout commented Jun 11, 2021 • edited Loading

korthout commented Jul 19, 2021

Zelldon commented Dec 7, 2021

korthout commented Jun 11, 2021 •

edited

Loading