-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics reporting unusually high number of written records #7218
Comments
Also seen in Note, there is a leader change happening at that time, to make it visible in the dashboard, edit the role changes panel, and set the min interval of the prometheus query options to 1s. |
7255: Increase role change granularity in Grafana r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> In some cases, the role change happens so fast and changes back to the previous state (in less than 15s). When that happens, grafana was unable to show the election cycle. By reducing the query interval for the datasource we can increase the granularity and show these role changes. The interval is only changed for the role change panel, to not impact performance. ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #7218 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
Looks like the |
The issue is that we calculate the amount here https://github.com/camunda-cloud/zeebe/blob/develop/engine/src/main/java/io/camunda/zeebe/engine/processing/streamprocessor/ProcessingStateMachine.java#L368 But the Ideally we would set the lastWrittenEventPosition on fail-over, maybe to the last processed position (which we get from the state) or something similar. |
8323: Reset last written position r=Zelldon a=Zelldon ## Description Previously the last written position was not reset on starting the processing state machine, which could cause weird metric exports, because the gap between lastWritten and newWritten position was to high (over million and more). The replay state machine now returns all positions and initialized the processing state machine with it. <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #7218 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
8323: Reset last written position r=Zelldon a=Zelldon ## Description Previously the last written position was not reset on starting the processing state machine, which could cause weird metric exports, because the gap between lastWritten and newWritten position was to high (over million and more). The replay state machine now returns all positions and initialized the processing state machine with it. <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #7218 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
8323: Reset last written position r=Zelldon a=Zelldon ## Description Previously the last written position was not reset on starting the processing state machine, which could cause weird metric exports, because the gap between lastWritten and newWritten position was to high (over million and more). The replay state machine now returns all positions and initialized the processing state machine with it. <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #7218 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Describe the bug
The Grafana dashboard is showing an unusually high number of written records (~3M compared to ~3K per minute). See screenshots.
This seems to coincide with the restart of broker
zeebe-2
, which at the time of the restart was leader for partition 2. Partitions 1 and 3 were lead byzeebe-0
, which took over leadership of partition 2 afterzeebe-2
went down.Screenshots
Larger timeframe
Shows high peaks in processing per partition panel. Also note that written normally is about 3K per minute per partition.
Focussed timeframe
Shows unusually high numbers as written in the processing per partition panel.
To Reproduce
Unknown.
I've tried to reproduce it on the same benchmark by restarting pods individually, but without success.
Expected behavior
Failover should not lead to a report of additional written records
Environment:
gke_zeebe-io_europe-west1-b_zeebe-cluster
medic-cw-23-f01b5bc83-benchmark
The text was updated successfully, but these errors were encountered: