Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operate import slows down in case of data loss #19424

Closed
5 tasks
sdorokhova opened this issue Jun 17, 2024 · 1 comment · Fixed by #19429 or #19431
Closed
5 tasks

Operate import slows down in case of data loss #19424

sdorokhova opened this issue Jun 17, 2024 · 1 comment · Fixed by #19429 or #19431
Assignees
Labels
component/operate Related to the Operate component/team kind/bug Categorizes an issue or PR as a bug Release: 8.3.13 Release: 8.4.10 Release: 8.5.4 support Marks an issue as related to a customer support request version:8.2.28 Label that represents issues released on version 8.2.28

Comments

@sdorokhova
Copy link
Contributor

sdorokhova commented Jun 17, 2024

Describe the bug

When searching for flow node instance parents we try for 2 times with 2 seconds delay for the case when parent flow node instance was imported with the previous batch but Elastic refresh did not yet happen. This 2 seconds sleep becomes a problem in case of high data load and piece of data with the parents has been lost. In this case for every imported child we will wait for 2 seconds when searching for parent which make import extremely slow.

We should avoid blocking import for 2 seconds.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy the process with subprocess, or multi-instance flow node.
  2. Start the process in Zeebe and wait till flow node inside subprocess or multi-instance is activated.
  3. Remove Zeebe records for parent flow node (e.g. subprocess)
  4. Start Operate in debug mode and observe how the import behaves.

Current behavior

The import is waiting for 2 seconds before reporting missing parent and continuing.

Expected behavior

No waiting.

Environment

  • Operate Version: 8.2.22.

Additional context

Related support case: https://jira.camunda.com/browse/SUPPORT-22204


Acceptance Criteria

Definition of Ready - Checklist

  • The bug has been reproduced by the assignee in the environment compatible with the provided one; otherwise, the issue is closed with a comment
  • The issue has a meaningful title, description, and testable acceptance criteria
  • The issue has been labeled with an appropriate Bug-area label
  • Necessary screenshots, screen recordings, or files are attached to the bug report

For UI changes required to solve the bug:

  • Design input has been collected by the assignee

Implementation

🔍 Root Cause Analysis

💭 Proposed Solution

👉 Handover Dev to QA

  • Changed components:
  • Side effects on other components:
  • Handy resources:
    BPMN/DMN models, plugins, scripts, REST API endpoints + example payload, etc :
  • Example projects:
  • Commands/Steps needed to test; Versions to validate:
  • Docker file / HELM chart : in case that it needed to be tested via docker share the version contain the fixed along with version of other services .
  • Release version ( in which version this fixed/feature will be released):

📗 Link to the test case

@sdorokhova sdorokhova added kind/bug Categorizes an issue or PR as a bug component/operate Related to the Operate component/team labels Jun 17, 2024
@sdorokhova sdorokhova changed the title Operate import slow down in case of data loss Operate import slows down in case of data loss Jun 17, 2024
@github-actions github-actions bot added the support Marks an issue as related to a customer support request label Jun 17, 2024
sdorokhova added a commit that referenced this issue Jun 17, 2024
* introduce new config parameters `camunda.operate.importer.retryReadingParents` which will prevent retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we want to wait for Elastic refresh shards, before processing the next batch. Replaced it with scheduling next call with delay. Also increased backoff period to 5 sec.

Closes #19424
sdorokhova added a commit that referenced this issue Jun 17, 2024
* introduce new config parameters `camunda.operate.importer.retryReadingParents` which will prevent retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we want to wait for Elastic refresh shards, before processing the next batch. Replaced it with scheduling next call with delay. Also increased backoff period to 5 sec.

Closes #19424
sdorokhova added a commit that referenced this issue Jun 17, 2024
* introduce new config parameters `camunda.operate.importer.retryReadingParents`
which will prevent retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we want to wait
for Elastic refresh shards, before processing the next batch. Replaced it with scheduling
next call with delay. Also increased backoff period to 5 sec.

Closes #19424
sdorokhova added a commit that referenced this issue Jun 17, 2024
* introduce new config parameters `camunda.operate.importer.retryReadingParents`
which will prevent retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we want
to wait for Elastic refresh shards, before processing the next batch.
Replaced it with scheduling next call with delay. Also increased backoff period to 5 sec.

Closes #19424
@mihail-ca
Copy link
Contributor

/public

@sdorokhova sdorokhova self-assigned this Jun 17, 2024
@sdorokhova sdorokhova linked a pull request Jun 17, 2024 that will close this issue
sdorokhova added a commit that referenced this issue Jun 18, 2024
* introduce new config parameters
`camunda.operate.importer.retryReadingParents` which will prevent
retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we
want to wait for Elastic refresh shards, before processing the next
batch. Replaced it with scheduling next call with delay. Also increased
backoff period to 5 sec.

Closes #19424
sdorokhova added a commit that referenced this issue Jun 19, 2024
* introduce new config parameters `camunda.operate.importer.retryReadingParents`
which will prevent retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we want
to wait for Elastic refresh shards, before processing the next batch.
Replaced it with scheduling next call with delay. Also increased backoff period to 5 sec.

Closes #19424
@sdorokhova sdorokhova added the version:8.2.28 Label that represents issues released on version 8.2.28 label Jul 1, 2024
sdorokhova added a commit that referenced this issue Jul 1, 2024
* introduce new config parameters `camunda.operate.importer.retryReadingParents`
which will prevent retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we want
to wait for Elastic refresh shards, before processing the next batch.
Replaced it with scheduling next call with delay. Also increased backoff period to 5 sec.

Closes #19424
sdorokhova added a commit that referenced this issue Jul 8, 2024
* introduce new config parameters `camunda.operate.importer.retryReadingParents`
which will prevent retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we want
to wait for Elastic refresh shards, before processing the next batch.
Replaced it with scheduling next call with delay. Also increased backoff period to 5 sec.

Closes #19424
renovate bot pushed a commit that referenced this issue Jul 8, 2024
* introduce new config parameters
`camunda.operate.importer.retryReadingParents` which will prevent
retrying with sleep call when reading parents from Elastic/Opensearch
* we have a `sleep` call in Incident post processor also. The reason: we
want to wait for Elastic refresh shards, before processing the next
batch. Replaced it with scheduling next call with delay. Also increased
backoff period to 5 sec.

Closes #19424
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/operate Related to the Operate component/team kind/bug Categorizes an issue or PR as a bug Release: 8.3.13 Release: 8.4.10 Release: 8.5.4 support Marks an issue as related to a customer support request version:8.2.28 Label that represents issues released on version 8.2.28
Projects
None yet
2 participants