Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make activated jobs which were not send to clients re-activatable #3631

Closed
deepthidevaki opened this issue Jan 10, 2020 · 10 comments · Fixed by #8879
Closed

Make activated jobs which were not send to clients re-activatable #3631

deepthidevaki opened this issue Jan 10, 2020 · 10 comments · Fixed by #8879
Assignees
Labels
area/performance Marks an issue as performance related blocker/stakeholder Marks an issue as blocked, waiting for stakeholder input kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog scope/gateway Marks an issue or PR to appear in the gateway section of the changelog support Marks an issue as related to a customer support request version:1.2.12 Marks an issue as being completely or in parts released in 1.2.12 version:1.3.6 Marks an issue as being completely or in parts released in 1.3.6

Comments

@deepthidevaki
Copy link
Contributor

deepthidevaki commented Jan 10, 2020

Description

Sometimes the client connection get closed (long polling timed out, network interruption, client died etc.) before gateway could send the activate job response back to the client. This results in the job being marked as activated in the broker, but never gets picked up by a client until the specified job timeout. This might be acceptable in some cases, but if the job time out is high clients might observe a huge latency between the time a job is created and completed.

This can happen at any time, but this is more frequent with long polling. (see #3585)

Proposal

When gateway realizes that response cannot be sent to the client, it can send a request to cancel activation to the broker.


related to SUPPORT-13198

@deepthidevaki deepthidevaki added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog labels Jan 10, 2020
@Zelldon Zelldon added Status: Needs Priority kind/feature Categorizes an issue or PR as a feature, i.e. new behavior area/performance Marks an issue as performance related and removed kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. labels May 12, 2020
@antoniodfr
Copy link

Hi, we are having errors due to this bug with Zeebe 0.26.1 in our production environment, do you know when will it be planned?

Thank you in advance.

@deepthidevaki
Copy link
Contributor Author

This is currently in the backlog. However to prioritize it could you give use more context?

  • How do you verify if it is really this issue?
  • Is it occasionally an activated job is stuck, or does it happen frequently?
  • How much impact do you have due to this?

@antoniodfr
Copy link

We identified it in this forum thread

It happens occasionally, but our business logic doesn't allow retries so we can't decrease the jobTimeout, and when the issue happens, we lose a customer in out e-commerce due to timeout.

@korthout
Copy link
Member

@npepinpe I want to bring some additional attention to this issue. I regularly see community members run into some form of #5387, for which this can be a solution.

Having discussed it with @saig0, we think the idea is good, but an explicit JobIntent would be necessary to make it clear that this is not a user sending a FailJob.

@korthout
Copy link
Member

I have given some thoughts to potential JobIntents:

  • defer - seems like it would not be activatable for some time once its deferred
  • recover - seems like it would also affect the number of retries or be able to recover from an error
  • reject - seems like it would mean that this client would no longer want to be able to activate this job
  • relinquish - seems like it would mean that this client would no longer want to be able to activate this job
  • reset - seems like it would also affect the number of retries
  • return - seems unclear whether it was completed, or not
  • shelve - seems like it would not be activatable once it is suspended
  • suspend - seems like it would not be activatable once it is suspended
  • yield - IMO feels the most like the job is given back to the engine for re-activation, although it is similar to relinquish and reject in that it might have an expectation that this client won't want to be able to activate the job again, but IMO this feeling is less strong than with relinquish and reject

@npepinpe
Copy link
Member

What is the expected impact of fixing this, in concrete terms? Also, I'm not sure how this relates to the call closed. I imagine it is for some calls, but do we know anything more about that?

@korthout
Copy link
Member

korthout commented Jan 14, 2022

I believe this is what users experience when they say that jobs sometimes get lost while delivering them to the client. I believe this would resolve that problem.

For example, this would already happen when a client sends an activate jobs request, the gateway has long polling enabled and the client's request times out before a new job was activated in the broker. Once the job is available and is activated, the client no longer has a connection with the gateway and the gateway can't deliver the job.

@npepinpe
Copy link
Member

Fair point. I'll bring it up again during planning, but it looks like our engine team is pretty busy this quarter.

@npepinpe npepinpe added the blocker/stakeholder Marks an issue as blocked, waiting for stakeholder input label Jan 14, 2022
@jerry123888
Copy link

hi @npepinpe .
At present this problem has a great impact on us and has received many user complaints.
Do you have any plans to start? Is it convenient to tell me the detail start time ?Looking forward to your problem solving

@Young200808
Copy link

Young200808 commented Feb 16, 2022

@npepinpe
any plan for fixing the issue for java client? it has big impact on our production enviroment.

@romansmirnov romansmirnov added support Marks an issue as related to a customer support request scope/gateway Marks an issue or PR to appear in the gateway section of the changelog kind/bug Categorizes an issue or PR as a bug and removed kind/feature Categorizes an issue or PR as a feature, i.e. new behavior labels Mar 15, 2022
@romansmirnov romansmirnov self-assigned this Mar 17, 2022
@ghost ghost closed this as completed in 92df866 Mar 22, 2022
ghost pushed a commit that referenced this issue Mar 23, 2022
8958: [Backport stable/1.2] #3631: Re-activate jobs r=romansmirnov a=github-actions[bot]

# Description
Backport of #8879 to `stable/1.2`.

relates to #3631

Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Mar 23, 2022
8959: [Backport stable/1.3] #3631: Re-activate jobs r=romansmirnov a=github-actions[bot]

# Description
Backport of #8879 to `stable/1.3`.

relates to #3631

Co-authored-by: Roman <roman.smirnov@camunda.com>
@npepinpe npepinpe added version:1.3.6 Marks an issue as being completely or in parts released in 1.3.6 Release: 8.0.0-rc1 labels Mar 25, 2022
@npepinpe npepinpe added the version:1.2.12 Marks an issue as being completely or in parts released in 1.2.12 label Apr 5, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related blocker/stakeholder Marks an issue as blocked, waiting for stakeholder input kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog scope/gateway Marks an issue or PR to appear in the gateway section of the changelog support Marks an issue as related to a customer support request version:1.2.12 Marks an issue as being completely or in parts released in 1.2.12 version:1.3.6 Marks an issue as being completely or in parts released in 1.3.6
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants