I can't activate jobs with a high max job count #5525

Zelldon · 2020-10-07T11:52:18Z

Describe the bug

Reported by an user I have investigate an issue where it seems they are not able to activate more jobs. They changed the default configuration of the maxMessageSize to 128KB and experienced errors in the broker which indicate that job batch record is to large.

To Reproduce

Run a cluster with maxMessageSize set to 128KB
Run our starters to create a backlog of jobs
Start an worker locally with maxActivationCount 1000+
FetchVariables needs to be set to "" otherwise we will fetch all variables in our benchmarks and it seem to work that the splitting of job activation succeeds

If we set the FetchVariables to empty, then the broker seems to try to put more jobs into the job activation and fails on doing that.

                client.NewWorker()
                      .JobType(JobType)
                      .Handler(HandleJob)
                      .MaxJobsActive(5000)
                      .Name(WorkerName)
                      .AutoCompletion()
                      .PollInterval(TimeSpan.FromSeconds(1))
                      .Timeout(TimeSpan.FromSeconds(10))
                      .FetchVariables("")
                      .Open();

Expected behavior
I can activate jobs without issues. If the max message size is reached it will still returned me jobs up to the specific limit.

Log/Stacktrace
https://console.cloud.google.com/logs/viewer?interval=NO_LIMIT&authuser=1&project=zeebe-io&minLogLevel=0&expandAll=false&timestamp=2020-10-07T11:40:15.009000000Z&customFacets=&limitCustomFacetWidth=true&advancedFilter=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22zeebe-io%22%0Aresource.labels.location%3D%22europe-west1-b%22%0Aresource.labels.cluster_name%3D%22zeebe-cluster%22%0Aresource.labels.namespace_name%3D%22zell-small-msg-size%22%0Alabels.k8s-pod%2Fapp_kubernetes_io%2Fcomponent%3D%22broker%22%0Alabels.k8s-pod%2Fapp_kubernetes_io%2Finstance%3D%22zell-small-msg-size%22%0Alabels.k8s-pod%2Fapp_kubernetes_io%2Fmanaged-by%3D%22Helm%22%0Alabels.k8s-pod%2Fapp_kubernetes_io%2Fname%3D%22zeebe-cluster%22&scrollTimestamp=2020-10-07T11:39:38.166836000Z&pinnedLogId=hrguh6fg7uydd&pinnedLogTimestamp=2020-10-07T11:39:38.166836Z

Full Stacktrace

Expected to process event 'TypedEventImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=ACTIVATE, requestStreamId=1, requestId=533714, protocolVersion=2, valueType=JOB_BATCH, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=0.25.0}, value={"type":"benchmark-task","worker":"zell-T490","timeout":10000,"maxJobsToActivate":5000,"jobKeys":[2251799820188250,2251799820188259,2251799820188260,2251799820188282,2251799820188295,2251799820188296,2251799820188393,2251799820188400,2251799820188401,2251799820188429,2251799820188526,2251799820188557,2251799820188600,2251799820188601,2251799820188811,2251799820188851,2251799820188853,2251799820188854,2251799820188855,2251799820188857,2251799820188858,2251799820188864,2251799820188866,2251799820188875,2251799820188895,2251799820188896,2251799820188897,2251799820188898,2251799820188916,2251799820188942,2251799820189019,2251799820189048,2251799820189062,2251799820189067,2251799820189068,2251799820189082,2251799820189083,2251799820189090,2251799820189094,2251799820189097,2251799820189098,2251799820189099,2251799820189163,2251799820189166,2251799820189189,2251799820189206,2251799820189229,2251799820189230,2251799820189232,2251799820189259,2251799820189267,2251799820189309,2251799820189343,2251799820189393,22517998...}' without errors, but exception occurred with message 'Expected to claim segment of size 152616, but can't claim more than 131072 bytes.' ."

Environment:

OS: k8s
Zeebe Version: 0.24.2, 0.25.0-SNAPSHOT
Configuration: default + maxMessageSize = 128KB

The text was updated successfully, but these errors were encountered:

npepinpe · 2020-10-08T07:58:53Z

Let's make sure we properly handle not returning more jobs than we can fit in there.

Zelldon · 2020-10-26T08:48:40Z

found an related old issue #1578

korthout · 2020-11-06T15:51:39Z

Linked error on camunda cloud.

npepinpe · 2020-11-09T15:21:33Z

I'm downgrading this, as while I still think this is necessary for GA, it's not necessary immediately.

Zelldon · 2020-12-10T07:15:54Z

Might be fixed or at least the chance that this happens is reduced by #5991

falko · 2021-01-25T18:16:29Z

My current customer emphasized the importance of this issue again.

Zelldon · 2021-01-26T06:51:14Z

@falko which version are they using?

falko · 2021-02-22T15:30:47Z

Last time my customer reproduced it 0.25.3. Now they are using 0.26 but trying to avoid the issue with workarounds

falko · 2021-02-22T15:31:44Z

Are you planning to include this in 1.0.0?

npepinpe · 2021-02-22T15:33:17Z

It doesn't look like it at the moment - feel free to bring it up in the stakeholder meetings if you think it should have higher priority.

8798: Add API to probe the logstream batch writer if more bytes can be written without writing them r=npepinpe a=npepinpe ## Description This PR adds a new API method, `LogStreamBatchWriter#canWriteAdditionalEvent(int)`. This allows users of the writer to probe if adding the given amount of bytes to the batch would cause it to become un-writable, without actually having to write anything to the batch, or even modify their DTO (e.g. the `TypedRecord<?>` in the engine). To avoid having dispatcher details leak into the implementation, an analogous method is added to the dispatcher, `Dispatcher#canClaimFragmentBatch(int, int)`, which will compare the given size, framed and aligned, with the max fragment length. This is the main building block to eventually solve #5525, and enable other use cases (e.g. multi-instance creation) which deal with large batches until we have a more permanent solution (e.g. chunking follow up batches). NOTE: the tests added in the dispatcher are not very good, but I couldn't come up with something else that wouldn't be too coupled to the implementation (i.e. essentially reusing `LogBufferAppender`). I would like some ideas/suggestions. NOTE: this PR comes out of the larger one, #8491. You can check that one out to see how the new API would be used, e.g. in the `JobBatchCollector`. As such, this is marked for backporting, since we'll backport the complete fix for #5525. ## Related issues related to #5525 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

8809: [Backport stable/1.2] Add API to probe the logstream batch writer if more bytes can be written without writing them r=npepinpe a=github-actions[bot] # Description Backport of #8798 to `stable/1.2`. relates to #5525 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

8810: [Backport stable/1.3] Add API to probe the logstream batch writer if more bytes can be written without writing them r=npepinpe a=github-actions[bot] # Description Backport of #8798 to `stable/1.3`. relates to #5525 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

8797: Extend Either/EitherAssert capabilities r=npepinpe a=npepinpe ## Description This PR extends `Either` by adding a new API, `Either#getOrElse(R)`. This allows to extract the right value of the `Either` or returning a fallback. I did not add any tests as the implementation is incredibly simple, and I can't foresee ever getting more complex, but do challenge this. It also extends the related `EitherAssert` by adding a new adding new `left` and `right` extraction capabilities. So you can now assert something like: ```java EitherAssert.assertThat(either).left().isEqualTo(1); EitherAssert.assertThat(instantEither) .right() .asInstanceOf(InstanceOfAssertFactories.INSTANT) .isBetween(today, tomorrow); ``` Note that calling `EitherAssert#right()` will, under the hood, still call `EitherAssert#isRight()`. This PR is related to #5525 and is extracted from the bigger spike in #8491. You can review how it's used there, specifically in the `JobBatchCollectorTest`. As such, this is marked for backporting, since we'll backport the complete fix for #5525. ## Related issues related to #5525 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: Nicolas Pepin-Perreault <43373+npepinpe@users.noreply.github.com>

8815: [Backport stable/1.2] Extend Either/EitherAssert capabilities r=npepinpe a=npepinpe ## Description This PR backports #8797 to stable/1.2, which is necessary to in the end backport the fix for #5525. There were some conflicts which I corrected with the last commit (one extra dependency pulled). ## Related issues backports #8797 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: Nicolas Pepin-Perreault <43373+npepinpe@users.noreply.github.com>

8814: [Backport stable/1.3] Extend Either/EitherAssert capabilities r=npepinpe a=npepinpe ## Description This PR backports #8797 to stable/1.3, which is necessary to in the end backport the fix for #5525. There were some conflicts which I corrected with the last commit (one extra dependency pulled). ## Related issues backports #8797 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: Nicolas Pepin-Perreault <43373+npepinpe@users.noreply.github.com>

8832: [Backport stable/1.3] Correctly truncate a job activation batch if it will not fit in the dispatcher r=npepinpe a=github-actions[bot] # Description Backport of #8799 to `stable/1.3`. relates to #8797 #5525 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

8831: [Backport stable/1.2] Correctly truncate a job activation batch if it will not fit in the dispatcher r=npepinpe a=github-actions[bot] # Description Backport of #8799 to `stable/1.2`. relates to #8797 #5525 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Availability severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Oct 7, 2020

github-actions bot added the Status: Needs Triage label Oct 7, 2020

Zelldon added Status: Needs Priority and removed Status: Needs Triage labels Oct 7, 2020

npepinpe added Priority: High and removed Status: Needs Priority labels Oct 8, 2020

npepinpe added Priority: Mid and removed Priority: High labels Nov 9, 2020

npepinpe added Status: Ready and removed Status: Planned labels Jan 12, 2021

npepinpe added this to Ready in Zeebe Mar 24, 2021

npepinpe moved this from Ready to Planned in Zeebe Apr 23, 2021

npepinpe added Status: Planned and removed Status: Ready labels Apr 26, 2021

npepinpe removed Status: Planned labels May 6, 2021

npepinpe mentioned this issue Feb 14, 2022

Fix estimation of claimed batch length when truncating job batch activation records #8491

Closed

9 tasks

npepinpe moved this from In progress to Review in progress in Zeebe Feb 14, 2022

This was referenced Feb 16, 2022

Extend Either/EitherAssert capabilities #8797

Merged

Add API to probe the logstream batch writer if more bytes can be written without writing them #8798

Merged

Correctly truncate a job activation batch if it will not fit in the dispatcher #8799

Merged

This was referenced Feb 17, 2022

[Backport stable/1.2] Add API to probe the logstream batch writer if more bytes can be written without writing them #8809

Merged

[Backport stable/1.3] Add API to probe the logstream batch writer if more bytes can be written without writing them #8810

Merged

This was referenced Feb 18, 2022

[Backport stable/1.3] Extend Either/EitherAssert capabilities #8814

Merged

[Backport stable/1.2] Extend Either/EitherAssert capabilities #8815

Merged

ghost closed this as completed in bd35278 Feb 23, 2022

Zeebe automation moved this from Review in progress to Done Feb 23, 2022

This was referenced Feb 23, 2022

[Backport stable/1.2] Correctly truncate a job activation batch if it will not fit in the dispatcher #8831

Merged

[Backport stable/1.3] Correctly truncate a job activation batch if it will not fit in the dispatcher #8832

Merged

Zelldon added Release: 1.2.11 version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 labels Mar 1, 2022

lenaschoenburg added the Release: 1.4.0-alpha2 label Mar 1, 2022

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

npepinpe added the Release: 8.0.0-rc1 label Mar 28, 2022

pihme added the Release: 8.0.0 label Apr 5, 2022

Zelldon mentioned this issue Aug 22, 2022

Add new RecordBatch classes and interfaces #10118

Merged

10 tasks

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can't activate jobs with a high max job count #5525

I can't activate jobs with a high max job count #5525

Zelldon commented Oct 7, 2020 •

edited

Loading

npepinpe commented Oct 8, 2020

Zelldon commented Oct 26, 2020

korthout commented Nov 6, 2020

npepinpe commented Nov 9, 2020

Zelldon commented Dec 10, 2020

falko commented Jan 25, 2021

Zelldon commented Jan 26, 2021

falko commented Feb 22, 2021

falko commented Feb 22, 2021

npepinpe commented Feb 22, 2021

I can't activate jobs with a high max job count #5525

I can't activate jobs with a high max job count #5525

Comments

Zelldon commented Oct 7, 2020 • edited Loading

npepinpe commented Oct 8, 2020

Zelldon commented Oct 26, 2020

korthout commented Nov 6, 2020

npepinpe commented Nov 9, 2020

Zelldon commented Dec 10, 2020

falko commented Jan 25, 2021

Zelldon commented Jan 26, 2021

falko commented Feb 22, 2021

falko commented Feb 22, 2021

npepinpe commented Feb 22, 2021

Zelldon commented Oct 7, 2020 •

edited

Loading