-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requests to activate jobs may result in infinite execution from the Gateway to the brokers #8310
Comments
@deepthidevaki and @npepinpe, Based on the observations made in #8267, I spent some time to get a better understanding of the long-polling implementation. To my understanding, it does not behave as documented in the "API" which might cause different issues like the request might be executed in a "loop" Do I miss something? And if not, should we adjust the implementation according to the documentation or the other way around? Should we consider this for this quarter Q4? |
I think this is related or superset of #7659 |
8391: fix(polling): respect request timeout settings r=oleschoenburg a=romansmirnov ## Description * If long polling is disabled by the received request, then always complete the request immediately even when no jobs are activated. * Ensure that the provided request timeout is respected so that the request completes at latest at the given timeout. ## Related issues <!-- Which issues are closed by this PR or are related --> relates #8310 closes #8389 Co-authored-by: Roman <roman.smirnov@camunda.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
8448: [Backport stable/1.2] deps(maven): bump version.elasticsearch from 7.16.1 to 7.16.2 r=menski a=npepinpe Bumps `version.elasticsearch` from 7.16.1 to 7.16.2. Updates `elasticsearch-x-content` from 7.16.1 to 7.16.2 - [Release notes](https://github.com/elastic/elasticsearch/releases) - [Commits](elastic/elasticsearch@v7.16.1...v7.16.2) Updates `elasticsearch-rest-client` from 7.16.1 to 7.16.2 - [Release notes](https://github.com/elastic/elasticsearch/releases) - [Commits](elastic/elasticsearch@v7.16.1...v7.16.2) --- updated-dependencies: - dependency-name: org.elasticsearch:elasticsearch-x-content dependency-type: direct:production update-type: version-update:semver-patch - dependency-name: org.elasticsearch.client:elasticsearch-rest-client dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> (cherry picked from commit 4f41e4f) 8450: [Backport stable/1.2] fix(polling): respect request timeout settings r=oleschoenburg a=github-actions[bot] # Description Backport of #8391 to `stable/1.2`. relates to #8310 #8389 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sebastian Menski <sebastian.menski@camunda.com> Co-authored-by: Roman <roman.smirnov@camunda.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
8447: [Backport stable/1.1] deps(maven): bump version.elasticsearch from 7.16.1 to 7.16.2 r=menski a=npepinpe Bumps `version.elasticsearch` from 7.16.1 to 7.16.2. Updates `elasticsearch-x-content` from 7.16.1 to 7.16.2 - [Release notes](https://github.com/elastic/elasticsearch/releases) - [Commits](elastic/elasticsearch@v7.16.1...v7.16.2) Updates `elasticsearch-rest-client` from 7.16.1 to 7.16.2 - [Release notes](https://github.com/elastic/elasticsearch/releases) - [Commits](elastic/elasticsearch@v7.16.1...v7.16.2) --- updated-dependencies: - dependency-name: org.elasticsearch:elasticsearch-x-content dependency-type: direct:production update-type: version-update:semver-patch - dependency-name: org.elasticsearch.client:elasticsearch-rest-client dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> (cherry picked from commit 4f41e4f) 8449: [Backport stable/1.1] fix(polling): respect request timeout settings r=oleschoenburg a=github-actions[bot] # Description Backport of #8391 to `stable/1.1`. relates to #8310 #8389 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sebastian Menski <sebastian.menski@camunda.com> Co-authored-by: Roman <roman.smirnov@camunda.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
8392: fix(polling/state): prevent duplicates in repeatable requests list r=oleschoenburg a=romansmirnov ## Description <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> relates #8310 closes #8390 Co-authored-by: Roman <roman.smirnov@camunda.com>
8436: [Backport/stable 1.2] Fix ZeebePartition can be closed when there are ongoing transitions r=deepthidevaki a=deepthidevaki Backport of #8344 closes #7981 Due to merge conflicts, the commits that refactored the code are not backported. 8454: [Backport stable/1.2] fix(polling/state): prevent duplicates in repeatable requests list r=oleschoenburg a=github-actions[bot] # Description Backport of #8392 to `stable/1.2`. relates to #8310 #8390 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Roman <roman.smirnov@camunda.com>
Is this already fixed? Two related issues are already closed. Is there anything else to be done here? |
Describe the bug
Whenever the failed attempts (i.e,
InFlightLongPollingActivateJobsRequestsState#failedAttempts
) are reset, all currently active requests are added to the queue ofactiveRequestsToBeRepeated
:https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/InFlightLongPollingActivateJobsRequestsState.java#L40-L45
The failed attempts are reset when (1) jobs are activated successfully, and (2) whenever the broker notifies the gateway about newly available jobs.
In the case, the brokers respond with no activated jobs, the Gateway checks the
activeRequestsToBeRepeated
queue if the request should be repeated immediately or not. That meansx
is present in the queueactiveRequestsToBeRepeated
,x
, andfailedAttemptThreshold
then the request
x
will be executed over and over again (see line 163):https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/gateway/src/main/java/io/camunda/zeebe/gateway/impl/job/LongPollingActivateJobsHandler.java#L157-L167
Scenario 1: Duplicates in
activeRequestsToBeRepeated
when other requests succeedsGiven three activate requests
x
,y
, andz
:z
to broker -> currentactiveRequests = [ z ]
andactiveRequestsToBeRepeated = [ ]
y
to broker -> currentactiveRequests = [ z, y ]
andactiveRequestsToBeRepeated = [ ]
x
to broker -> currentactiveRequests = [ z, y, x ]
andactiveRequestsToBeRepeated = [ ]
z
completes with at least one activated job -> reset failed attempts -> currentactiveRequests = [ y, x ]
andactiveRequestsToBeRepeated = [ y, x ]
y
completes with at least one activated job -> reset failed attempts -> currentactiveRequests = [ x ]
andactiveRequestsToBeRepeated = [ x, x ]
Scenario 2: Duplicates in
activeRequestsToBeRepeated
on multiple notificationsGiven is a request
x
:x
to broker -> currentactiveRequests = [ x ]
andactiveRequestsToBeRepeated = [ ]
activeRequests = [ x ]
andactiveRequestsToBeRepeated = [ x ]
activeRequests = [ x ]
andactiveRequestsToBeRepeated = [ x, x ]
Scenario 3: Another request wins
Given are requests
[z1, z2, z3, ...., zn]
andx
:[z1, z2, z3, ...., zn]
andx
-> currentactiveRequests = [ z1, z2, z3, ...., zn, x ] and
activeRequestsToBeRepeated = [ ]`z1
completes with at least one activated job -> reset failed attempts -> currentactiveRequests = [ z2, z3, ...., zn, x ]
andactiveRequestsToBeRepeated = [ z2, z3, ...., zn, x ]
x
completes without any jobs -> retry request -> currentactiveRequests = [ z2, z3, ...., zn, x ]
andactiveRequestsToBeRepeated = [ z2, z3, ...., zn ]
z2
completes with at least one activated job -> reset failed attempts -> currentactiveRequests = [ z3, ...., zn, x ]
andactiveRequestsToBeRepeated = [ z3, ...., zn, x ]
x
completes without any jobs -> retry request -> currentactiveRequests = [ z3, ...., zn, x ]
andactiveRequestsToBeRepeated = [ z2, z3, ...., zn ]
In all scenarios, the queue
activeRequestsToBeRepeated
contains the requestx
at least once before the response arrives. In a nutshell, as long as the failed attempts are lower than the configured threshold and the requestx
is present in the queue of repeatable requests, the requestx
is executed in a loop when it responds with any activated job.Additionally, the property
requestTimeout
coming along with the activate request might not be considered in the scenarios above. Meaning, the providedrequestTimeout
is only considered whenAs long as none of these conditions are fulfilled, the request will be in an execution loop and the client won't receive any response until the client closes the connection/request after a certain timeout.
To my understanding, the current implementation does not behave as documented in the API:
https://github.com/camunda-cloud/zeebe/blob/12eeb3f81af0c1b65d2257cab52b71a712562757/gateway-protocol/src/main/proto/gateway.proto#L25-L28
To Reproduce
Expected behavior
activeRequestsToBeRepeated
should not contain duplicates, seeInFlightLongPollingActivateJobsRequestsState#activeRequestsToBeRepeated
contains duplicates #8390Environment:
is related to #8267
is related to #7659
depends on #8389
depends on #8390
The text was updated successfully, but these errors were encountered: