Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can't activate jobs with a high max job count #5525

Closed
Zelldon opened this issue Oct 7, 2020 · 16 comments · Fixed by #8799
Closed

I can't activate jobs with a high max job count #5525

Zelldon opened this issue Oct 7, 2020 · 16 comments · Fixed by #8799
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5

Comments

@Zelldon
Copy link
Member

Zelldon commented Oct 7, 2020

Describe the bug

Reported by an user I have investigate an issue where it seems they are not able to activate more jobs. They changed the default configuration of the maxMessageSize to 128KB and experienced errors in the broker which indicate that job batch record is to large.

To Reproduce

  • Run a cluster with maxMessageSize set to 128KB
  • Run our starters to create a backlog of jobs
  • Start an worker locally with maxActivationCount 1000+
  • FetchVariables needs to be set to "" otherwise we will fetch all variables in our benchmarks and it seem to work that the splitting of job activation succeeds

If we set the FetchVariables to empty, then the broker seems to try to put more jobs into the job activation and fails on doing that.

                client.NewWorker()
                      .JobType(JobType)
                      .Handler(HandleJob)
                      .MaxJobsActive(5000)
                      .Name(WorkerName)
                      .AutoCompletion()
                      .PollInterval(TimeSpan.FromSeconds(1))
                      .Timeout(TimeSpan.FromSeconds(10))
                      .FetchVariables("")
                      .Open();

Expected behavior
I can activate jobs without issues. If the max message size is reached it will still returned me jobs up to the specific limit.

Log/Stacktrace
https://console.cloud.google.com/logs/viewer?interval=NO_LIMIT&authuser=1&project=zeebe-io&minLogLevel=0&expandAll=false&timestamp=2020-10-07T11:40:15.009000000Z&customFacets=&limitCustomFacetWidth=true&advancedFilter=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22zeebe-io%22%0Aresource.labels.location%3D%22europe-west1-b%22%0Aresource.labels.cluster_name%3D%22zeebe-cluster%22%0Aresource.labels.namespace_name%3D%22zell-small-msg-size%22%0Alabels.k8s-pod%2Fapp_kubernetes_io%2Fcomponent%3D%22broker%22%0Alabels.k8s-pod%2Fapp_kubernetes_io%2Finstance%3D%22zell-small-msg-size%22%0Alabels.k8s-pod%2Fapp_kubernetes_io%2Fmanaged-by%3D%22Helm%22%0Alabels.k8s-pod%2Fapp_kubernetes_io%2Fname%3D%22zeebe-cluster%22&scrollTimestamp=2020-10-07T11:39:38.166836000Z&pinnedLogId=hrguh6fg7uydd&pinnedLogTimestamp=2020-10-07T11:39:38.166836Z

Full Stacktrace

Expected to process event 'TypedEventImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=ACTIVATE, requestStreamId=1, requestId=533714, protocolVersion=2, valueType=JOB_BATCH, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=0.25.0}, value={"type":"benchmark-task","worker":"zell-T490","timeout":10000,"maxJobsToActivate":5000,"jobKeys":[2251799820188250,2251799820188259,2251799820188260,2251799820188282,2251799820188295,2251799820188296,2251799820188393,2251799820188400,2251799820188401,2251799820188429,2251799820188526,2251799820188557,2251799820188600,2251799820188601,2251799820188811,2251799820188851,2251799820188853,2251799820188854,2251799820188855,2251799820188857,2251799820188858,2251799820188864,2251799820188866,2251799820188875,2251799820188895,2251799820188896,2251799820188897,2251799820188898,2251799820188916,2251799820188942,2251799820189019,2251799820189048,2251799820189062,2251799820189067,2251799820189068,2251799820189082,2251799820189083,2251799820189090,2251799820189094,2251799820189097,2251799820189098,2251799820189099,2251799820189163,2251799820189166,2251799820189189,2251799820189206,2251799820189229,2251799820189230,2251799820189232,2251799820189259,2251799820189267,2251799820189309,2251799820189343,2251799820189393,22517998...}' without errors, but exception occurred with message 'Expected to claim segment of size 152616, but can't claim more than 131072 bytes.' ." 

Environment:

  • OS: k8s
  • Zeebe Version: 0.24.2, 0.25.0-SNAPSHOT
  • Configuration: default + maxMessageSize = 128KB
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Availability severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Oct 7, 2020
@npepinpe
Copy link
Member

npepinpe commented Oct 8, 2020

Let's make sure we properly handle not returning more jobs than we can fit in there.

@Zelldon
Copy link
Member Author

Zelldon commented Oct 26, 2020

found an related old issue #1578

@korthout
Copy link
Member

korthout commented Nov 6, 2020

Linked error on camunda cloud.

@npepinpe
Copy link
Member

npepinpe commented Nov 9, 2020

I'm downgrading this, as while I still think this is necessary for GA, it's not necessary immediately.

@Zelldon
Copy link
Member Author

Zelldon commented Dec 10, 2020

Might be fixed or at least the chance that this happens is reduced by #5991

@falko
Copy link
Member

falko commented Jan 25, 2021

My current customer emphasized the importance of this issue again.

@Zelldon
Copy link
Member Author

Zelldon commented Jan 26, 2021

@falko which version are they using?

@falko
Copy link
Member

falko commented Feb 22, 2021

Last time my customer reproduced it 0.25.3. Now they are using 0.26 but trying to avoid the issue with workarounds

@falko
Copy link
Member

falko commented Feb 22, 2021

Are you planning to include this in 1.0.0?

@npepinpe
Copy link
Member

It doesn't look like it at the moment - feel free to bring it up in the stakeholder meetings if you think it should have higher priority.

@npepinpe npepinpe added this to Ready in Zeebe Mar 24, 2021
@npepinpe npepinpe moved this from Ready to Planned in Zeebe Apr 23, 2021
@npepinpe npepinpe moved this from In progress to Review in progress in Zeebe Feb 14, 2022
ghost pushed a commit that referenced this issue Feb 17, 2022
8798: Add API to probe the logstream batch writer if more bytes can be written without writing them r=npepinpe a=npepinpe

## Description

This PR adds a new API method, `LogStreamBatchWriter#canWriteAdditionalEvent(int)`. This allows users of the writer to probe if adding the given amount of bytes to the batch would cause it to become un-writable, without actually having to write anything to the batch, or even modify their DTO (e.g. the `TypedRecord<?>` in the engine).

To avoid having dispatcher details leak into the implementation, an analogous method is added to the dispatcher, `Dispatcher#canClaimFragmentBatch(int, int)`, which will compare the given size, framed and aligned, with the max fragment length. This is the main building block to eventually solve #5525, and enable other use cases (e.g. multi-instance creation) which deal with large batches until we have a more permanent solution (e.g. chunking follow up batches).

NOTE: the tests added in the dispatcher are not very good, but I couldn't come up with something else that wouldn't be too coupled to the implementation (i.e. essentially reusing `LogBufferAppender`). I would like some ideas/suggestions.

NOTE: this PR comes out of the larger one, #8491. You can check that one out to see how the new API would be used, e.g. in the `JobBatchCollector`. As such, this is marked for backporting, since we'll backport the complete fix for #5525.

## Related issues

related to #5525 



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
ghost pushed a commit that referenced this issue Feb 17, 2022
8798: Add API to probe the logstream batch writer if more bytes can be written without writing them r=npepinpe a=npepinpe

## Description

This PR adds a new API method, `LogStreamBatchWriter#canWriteAdditionalEvent(int)`. This allows users of the writer to probe if adding the given amount of bytes to the batch would cause it to become un-writable, without actually having to write anything to the batch, or even modify their DTO (e.g. the `TypedRecord<?>` in the engine).

To avoid having dispatcher details leak into the implementation, an analogous method is added to the dispatcher, `Dispatcher#canClaimFragmentBatch(int, int)`, which will compare the given size, framed and aligned, with the max fragment length. This is the main building block to eventually solve #5525, and enable other use cases (e.g. multi-instance creation) which deal with large batches until we have a more permanent solution (e.g. chunking follow up batches).

NOTE: the tests added in the dispatcher are not very good, but I couldn't come up with something else that wouldn't be too coupled to the implementation (i.e. essentially reusing `LogBufferAppender`). I would like some ideas/suggestions.

NOTE: this PR comes out of the larger one, #8491. You can check that one out to see how the new API would be used, e.g. in the `JobBatchCollector`. As such, this is marked for backporting, since we'll backport the complete fix for #5525.

## Related issues

related to #5525 



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
ghost pushed a commit that referenced this issue Feb 17, 2022
8809: [Backport stable/1.2] Add API to probe the logstream batch writer if more bytes can be written without writing them r=npepinpe a=github-actions[bot]

# Description
Backport of #8798 to `stable/1.2`.

relates to #5525

Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
ghost pushed a commit that referenced this issue Feb 17, 2022
8810: [Backport stable/1.3] Add API to probe the logstream batch writer if more bytes can be written without writing them r=npepinpe a=github-actions[bot]

# Description
Backport of #8798 to `stable/1.3`.

relates to #5525

Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
ghost pushed a commit that referenced this issue Feb 17, 2022
8797: Extend Either/EitherAssert capabilities r=npepinpe a=npepinpe

## Description

This PR extends `Either` by adding a new API, `Either#getOrElse(R)`. This allows to extract the right value of the `Either` or returning a fallback. I did not add any tests as the implementation is incredibly simple, and I can't foresee ever getting more complex, but do challenge this.

It also extends the related `EitherAssert` by adding a new adding new `left` and `right` extraction capabilities. So you can now assert something like:

```java
EitherAssert.assertThat(either).left().isEqualTo(1);
EitherAssert.assertThat(instantEither)
	.right()
	.asInstanceOf(InstanceOfAssertFactories.INSTANT)
	.isBetween(today, tomorrow);
```

Note that calling `EitherAssert#right()` will, under the hood, still call `EitherAssert#isRight()`.

This PR is related to #5525 and is extracted from the bigger spike in #8491. You can review how it's used there, specifically in the `JobBatchCollectorTest`. As such, this is marked for backporting, since we'll backport the complete fix for #5525.

## Related issues

related to #5525 



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Co-authored-by: Nicolas Pepin-Perreault <43373+npepinpe@users.noreply.github.com>
ghost pushed a commit that referenced this issue Feb 21, 2022
8815: [Backport stable/1.2] Extend Either/EitherAssert capabilities r=npepinpe a=npepinpe

## Description

This PR backports #8797 to stable/1.2, which is necessary to in the end backport the fix for #5525. There were some conflicts which I corrected with the last commit (one extra dependency pulled).

## Related issues

backports #8797 



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Co-authored-by: Nicolas Pepin-Perreault <43373+npepinpe@users.noreply.github.com>
ghost pushed a commit that referenced this issue Feb 21, 2022
8814: [Backport stable/1.3] Extend Either/EitherAssert capabilities r=npepinpe a=npepinpe

## Description

This PR backports #8797 to stable/1.3, which is necessary to in the end backport the fix for #5525. There were some conflicts which I corrected with the last commit (one extra dependency pulled).

## Related issues

backports #8797 



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Co-authored-by: Nicolas Pepin-Perreault <43373+npepinpe@users.noreply.github.com>
@ghost ghost closed this as completed in bd35278 Feb 23, 2022
Zeebe automation moved this from Review in progress to Done Feb 23, 2022
ghost pushed a commit that referenced this issue Feb 23, 2022
8832: [Backport stable/1.3] Correctly truncate a job activation batch if it will not fit in the dispatcher r=npepinpe a=github-actions[bot]

# Description
Backport of #8799 to `stable/1.3`.

relates to #8797 #5525

Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
ghost pushed a commit that referenced this issue Feb 23, 2022
8831: [Backport stable/1.2] Correctly truncate a job activation batch if it will not fit in the dispatcher r=npepinpe a=github-actions[bot]

# Description
Backport of #8799 to `stable/1.2`.

relates to #8797 #5525

Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
@Zelldon Zelldon added Release: 1.2.11 version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 labels Mar 1, 2022
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5
Projects
None yet
6 participants