Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics reporting unusually high number of written records #7218

Closed
korthout opened this issue Jun 8, 2021 · 3 comments · Fixed by #8323
Closed

Metrics reporting unusually high number of written records #7218

korthout opened this issue Jun 8, 2021 · 3 comments · Fixed by #8323
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0

Comments

@korthout
Copy link
Member

korthout commented Jun 8, 2021

Describe the bug

The Grafana dashboard is showing an unusually high number of written records (~3M compared to ~3K per minute). See screenshots.

This seems to coincide with the restart of broker zeebe-2, which at the time of the restart was leader for partition 2. Partitions 1 and 3 were lead by zeebe-0, which took over leadership of partition 2 after zeebe-2 went down.

Screenshots

Larger timeframe

Shows high peaks in processing per partition panel. Also note that written normally is about 3K per minute per partition.
Screen Shot 2021-06-08 at 13 35 53

Focussed timeframe

Shows unusually high numbers as written in the processing per partition panel.
Screen Shot 2021-06-08 at 13 36 13

To Reproduce

Unknown.

I've tried to reproduce it on the same benchmark by restarting pods individually, but without success.

Expected behavior

Failover should not lead to a report of additional written records

Environment:

  • Context: gke_zeebe-io_europe-west1-b_zeebe-cluster
  • Namespace: medic-cw-23-f01b5bc83-benchmark
  • Zeebe Version: f01b5bc
  • Configuration: benchmark
@korthout korthout added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Observability labels Jun 8, 2021
@korthout
Copy link
Member Author

korthout commented Jun 11, 2021

Also seen in medic-cw-22-ce77f891b-benchmark on 2021-06-10 at 15:33:00 CEST -> grafana

Note, there is a leader change happening at that time, to make it visible in the dashboard, edit the role changes panel, and set the min interval of the prometheus query options to 1s.

ghost pushed a commit that referenced this issue Jun 14, 2021
7255: Increase role change granularity in Grafana r=korthout a=korthout

## Description

<!-- Please explain the changes you made here. -->

In some cases, the role change happens so fast and changes back to the
previous state (in less than 15s). When that happens, grafana was unable
to show the election cycle. By reducing the query interval for the
datasource we can increase the granularity and show these role changes.
The interval is only changed for the role change panel, to not impact
performance.

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #7218



Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
@korthout
Copy link
Member Author

@Zelldon Zelldon self-assigned this Dec 7, 2021
@Zelldon
Copy link
Member

Zelldon commented Dec 7, 2021

The issue is that we calculate the amount here https://github.com/camunda-cloud/zeebe/blob/develop/engine/src/main/java/io/camunda/zeebe/engine/processing/streamprocessor/ProcessingStateMachine.java#L368

But the lastWrittenEventPosition can be zero on fail-over, which means that the amount is very high, since the writtenEventPosition is quite high.

Ideally we would set the lastWrittenEventPosition on fail-over, maybe to the last processed position (which we get from the state) or something similar.

ghost pushed a commit that referenced this issue Dec 8, 2021
8323: Reset last written position r=Zelldon a=Zelldon

## Description

Previously the last written position was not reset on starting the processing state machine, which could cause weird metric exports, because the gap between lastWritten and newWritten position was to high (over million and more). The replay state machine now returns all positions and initialized the processing state machine with it.
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #7218 



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
ghost pushed a commit that referenced this issue Dec 8, 2021
8323: Reset last written position r=Zelldon a=Zelldon

## Description

Previously the last written position was not reset on starting the processing state machine, which could cause weird metric exports, because the gap between lastWritten and newWritten position was to high (over million and more). The replay state machine now returns all positions and initialized the processing state machine with it.
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #7218 



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
ghost pushed a commit that referenced this issue Dec 10, 2021
8323: Reset last written position r=Zelldon a=Zelldon

## Description

Previously the last written position was not reset on starting the processing state machine, which could cause weird metric exports, because the gap between lastWritten and newWritten position was to high (over million and more). The replay state machine now returns all positions and initialized the processing state machine with it.
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #7218 



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@ghost ghost closed this as completed in 172a2cc Dec 10, 2021
@korthout korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants