Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests to activate jobs may result in infinite execution from the Gateway to the brokers #8310

Closed
2 tasks done
romansmirnov opened this issue Dec 2, 2021 · 4 comments
Closed
2 tasks done
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user support Marks an issue as related to a customer support request version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0

Comments

@romansmirnov
Copy link
Member

romansmirnov commented Dec 2, 2021

Describe the bug

Whenever the failed attempts (i.e, InFlightLongPollingActivateJobsRequestsState#failedAttempts) are reset, all currently active requests are added to the queue of activeRequestsToBeRepeated:

https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/InFlightLongPollingActivateJobsRequestsState.java#L40-L45

The failed attempts are reset when (1) jobs are activated successfully, and (2) whenever the broker notifies the gateway about newly available jobs.

In the case, the brokers respond with no activated jobs, the Gateway checks the activeRequestsToBeRepeated queue if the request should be repeated immediately or not. That means

  1. as long as an active request x is present in the queue activeRequestsToBeRepeated,
  2. no jobs are activated with request x, and
  3. the failed attempts are lower than the configured failedAttemptThreshold

then the request x will be executed over and over again (see line 163):

https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L157-L167

Scenario 1: Duplicates in activeRequestsToBeRepeated when other requests succeeds

Given three activate requests x, y, and z:

  1. send request z to broker -> current activeRequests = [ z ] and activeRequestsToBeRepeated = [ ]
  2. send request y to broker -> current activeRequests = [ z, y ] and activeRequestsToBeRepeated = [ ]
  3. send request x to broker -> current activeRequests = [ z, y, x ] and activeRequestsToBeRepeated = [ ]
  4. request z completes with at least one activated job -> reset failed attempts -> current activeRequests = [ y, x ] and activeRequestsToBeRepeated = [ y, x ]
  5. request y completes with at least one activated job -> reset failed attempts -> current activeRequests = [ x ] and activeRequestsToBeRepeated = [ x, x ]

Scenario 2: Duplicates in activeRequestsToBeRepeated on multiple notifications

Given is a request x:

  1. send request x to broker -> current activeRequests = [ x ] and activeRequestsToBeRepeated = [ ]
  2. Broker notifies the gateway about new available jobs -> reset failed attempts -> current activeRequests = [ x ] and activeRequestsToBeRepeated = [ x ]
  3. Again, Broker notifies the gateway about new available jobs -> reset failed attempts -> current activeRequests = [ x ] and activeRequestsToBeRepeated = [ x, x ]

Scenario 3: Another request wins

Given are requests [z1, z2, z3, ...., zn] and x:

  1. send requests [z1, z2, z3, ...., zn] and x -> current activeRequests = [ z1, z2, z3, ...., zn, x ] and activeRequestsToBeRepeated = [ ]`
  2. request z1 completes with at least one activated job -> reset failed attempts -> current activeRequests = [ z2, z3, ...., zn, x ] and activeRequestsToBeRepeated = [ z2, z3, ...., zn, x ]
  3. request x completes without any jobs -> retry request -> current activeRequests = [ z2, z3, ...., zn, x ] and activeRequestsToBeRepeated = [ z2, z3, ...., zn ]
  4. request z2 completes with at least one activated job -> reset failed attempts -> current activeRequests = [ z3, ...., zn, x ] and activeRequestsToBeRepeated = [ z3, ...., zn, x ]
  5. request x completes without any jobs -> retry request -> current activeRequests = [ z3, ...., zn, x ] and activeRequestsToBeRepeated = [ z2, z3, ...., zn ]
  6. ...

In all scenarios, the queue activeRequestsToBeRepeated contains the request x at least once before the response arrives. In a nutshell, as long as the failed attempts are lower than the configured threshold and the request x is present in the queue of repeatable requests, the request x is executed in a loop when it responds with any activated job.

Additionally, the property requestTimeout coming along with the activate request might not be considered in the scenarios above. Meaning, the provided requestTimeout is only considered when

  1. the request responds without any activated job, and
  2. the failed attempts are greater than the configured failed attempts threshold.

As long as none of these conditions are fulfilled, the request will be in an execution loop and the client won't receive any response until the client closes the connection/request after a certain timeout.

To my understanding, the current implementation does not behave as documented in the API:

https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/gateway-protocol/src/main/proto/gateway.proto#L25-L28

To Reproduce

Expected behavior

Environment:

  • Zeebe Version: 1.2.5

is related to #8267
is related to #7659
depends on #8389
depends on #8390

@romansmirnov romansmirnov added kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user scope/gateway Marks an issue or PR to appear in the gateway section of the changelog labels Dec 2, 2021
@romansmirnov romansmirnov self-assigned this Dec 2, 2021
@romansmirnov
Copy link
Member Author

@deepthidevaki and @npepinpe,

Based on the observations made in #8267, I spent some time to get a better understanding of the long-polling implementation. To my understanding, it does not behave as documented in the "API" which might cause different issues like the request might be executed in a "loop"

https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/gateway-protocol/src/main/proto/gateway.proto#L25-L28

Do I miss something? And if not, should we adjust the implementation according to the documentation or the other way around? Should we consider this for this quarter Q4?

@Zelldon
Copy link
Member

Zelldon commented Dec 2, 2021

I think this is related or superset of #7659

@romansmirnov
Copy link
Member Author

@Zelldon, thanks for linking issue #7659. It's actually quite an interesting scenario, you raised in the issue. However, but I don't think they are related, i.e., in my opinion, fixing this issue will not solve the problem described in #7659.

@romansmirnov romansmirnov moved this from In progress to Review in progress in Zeebe Dec 16, 2021
ghost pushed a commit that referenced this issue Dec 20, 2021
8391: fix(polling): respect request timeout settings r=oleschoenburg a=romansmirnov

## Description

* If long polling is disabled by the received request, then always complete the request immediately even when no jobs are activated.
* Ensure that the provided request timeout is respected so that the request completes at latest at the given timeout.

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates #8310 
closes #8389 



Co-authored-by: Roman <roman.smirnov@camunda.com>
Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Dec 20, 2021
8448: [Backport stable/1.2] deps(maven): bump version.elasticsearch from 7.16.1 to 7.16.2 r=menski a=npepinpe

Bumps `version.elasticsearch` from 7.16.1 to 7.16.2.

Updates `elasticsearch-x-content` from 7.16.1 to 7.16.2
- [Release notes](https://github.com/elastic/elasticsearch/releases)
- [Commits](elastic/elasticsearch@v7.16.1...v7.16.2)

Updates `elasticsearch-rest-client` from 7.16.1 to 7.16.2
- [Release notes](https://github.com/elastic/elasticsearch/releases)
- [Commits](elastic/elasticsearch@v7.16.1...v7.16.2)

---
updated-dependencies:
- dependency-name: org.elasticsearch:elasticsearch-x-content
  dependency-type: direct:production
  update-type: version-update:semver-patch
- dependency-name: org.elasticsearch.client:elasticsearch-rest-client
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
(cherry picked from commit 4f41e4f)

8450: [Backport stable/1.2] fix(polling): respect request timeout settings r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8391 to `stable/1.2`.

relates to #8310 #8389

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sebastian Menski <sebastian.menski@camunda.com>
Co-authored-by: Roman <roman.smirnov@camunda.com>
Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Dec 20, 2021
8447: [Backport stable/1.1] deps(maven): bump version.elasticsearch from 7.16.1 to 7.16.2 r=menski a=npepinpe

Bumps `version.elasticsearch` from 7.16.1 to 7.16.2.

Updates `elasticsearch-x-content` from 7.16.1 to 7.16.2
- [Release notes](https://github.com/elastic/elasticsearch/releases)
- [Commits](elastic/elasticsearch@v7.16.1...v7.16.2)

Updates `elasticsearch-rest-client` from 7.16.1 to 7.16.2
- [Release notes](https://github.com/elastic/elasticsearch/releases)
- [Commits](elastic/elasticsearch@v7.16.1...v7.16.2)

---
updated-dependencies:
- dependency-name: org.elasticsearch:elasticsearch-x-content
  dependency-type: direct:production
  update-type: version-update:semver-patch
- dependency-name: org.elasticsearch.client:elasticsearch-rest-client
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
(cherry picked from commit 4f41e4f)

8449: [Backport stable/1.1] fix(polling): respect request timeout settings r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8391 to `stable/1.1`.

relates to #8310 #8389

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sebastian Menski <sebastian.menski@camunda.com>
Co-authored-by: Roman <roman.smirnov@camunda.com>
Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Dec 20, 2021
8392: fix(polling/state): prevent duplicates in repeatable requests list r=oleschoenburg a=romansmirnov

## Description

<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates #8310
closes #8390



Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Dec 20, 2021
8451: [Backport release-1.3.0] fix(polling): respect request timeout settings r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8391 to `release-1.3.0`.

relates to #8310 #8389

Co-authored-by: Roman <roman.smirnov@camunda.com>
Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Dec 20, 2021
8436: [Backport/stable 1.2] Fix ZeebePartition can be closed when there are ongoing transitions r=deepthidevaki a=deepthidevaki

Backport of #8344 

closes #7981

Due to merge conflicts, the commits that refactored the code are not backported.

8454: [Backport stable/1.2] fix(polling/state): prevent duplicates in repeatable requests list r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8392 to `stable/1.2`.

relates to #8310 #8390

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Dec 20, 2021
8453: [Backport stable/1.1] fix(polling/state): prevent duplicates in repeatable requests list r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8392 to `stable/1.1`.

relates to #8310 #8390

Co-authored-by: Roman <roman.smirnov@camunda.com>
@deepthidevaki
Copy link
Contributor

Is this already fixed? Two related issues are already closed. Is there anything else to be done here?

@romansmirnov romansmirnov moved this from Review in progress to Done in Zeebe Dec 22, 2021
@korthout korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user support Marks an issue as related to a customer support request version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0
Projects
None yet
Development

No branches or pull requests

8 participants