Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To big Deployment can cause problems on distribution #5776

Closed
Zelldon opened this issue Nov 6, 2020 · 3 comments
Closed

To big Deployment can cause problems on distribution #5776

Zelldon opened this issue Nov 6, 2020 · 3 comments
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@Zelldon
Copy link
Member

Zelldon commented Nov 6, 2020

Describe the bug

If we reach with a deployment the maxMessageSize almost then we can get issues with the distributions, since the CREATE command which will be distributed contains more content then the initial CREATE. This means we deploy a workflow, which can only be started on partition one but not on other partitions.

To Reproduce

public final class DeploymentClusteredTest {

  private static final BpmnModelInstance WORKFLOW =
      Bpmn.createExecutableProcess("process").startEvent().endEvent().done();

  public final Timeout testTimeout = Timeout.seconds(120);
  public final ClusteringRule clusteringRule =
      new ClusteringRule(3, 3, 3, cfg -> cfg.getData().setUseMmap(false));
  public final GrpcClientRule clientRule = new GrpcClientRule(clusteringRule);

  @Rule
  public RuleChain ruleChain =
      RuleChain.outerRule(testTimeout).around(clusteringRule).around(clientRule);

  @Test
  public void shouldDeployWorkflowAndCreateInstances() {
    // when
    final var workflowKey =
        clientRule.deployWorkflow(
            Bpmn.readModelFromStream(
                this.getClass().getResourceAsStream("/workflows/bigone-task-process.bpmn")));

    final var workflowInstanceKeys =
        clusteringRule.getPartitionIds().stream()
            .map(
                partitionId ->
                    clusteringRule.createWorkflowInstanceOnPartition(partitionId, "process"))
            .collect(Collectors.toList());

    // then
    assertThat(
            RecordingExporter.workflowInstanceRecords(WorkflowInstanceIntent.ELEMENT_COMPLETED)
                .filterRootScope()
                .withWorkflowKey(workflowKey)
                .limit(clusteringRule.getPartitionCount()))
        .extracting(Record::getKey)
        .containsExactlyInAnyOrderElementsOf(workflowInstanceKeys);
  }
}

model.zip

Expected behavior

That the deployment is rejected, maybe?

Environment:

  • OS: arch
  • Zeebe Version: snapshot
  • Configuration: [e.g. exporters etc.]
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Availability severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Nov 6, 2020
@npepinpe
Copy link
Member

npepinpe commented Nov 9, 2020

What does the user/client see in this case? Is this something users can "fix"?

@npepinpe
Copy link
Member

It looks like this is the error the user gets:

Command 'CREATE' rejected with code 'PROCESSING_ERROR': Expected to process event 'TypedEventImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=CREATE, requestStreamId=1, requestId=0, protocolVersion=3, valueType=DEPLOYMENT, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=1.2.0}, value={"resources":[{"resourceName":"process.bpmn","resource":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxicG1uOmRlZmluaXRpb25zIHhtbG5zOmJwbW49Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0JQTU4vMjAxMDA1MjQvTU9ERUwiIHhtbG5zOmJwbW5kaT0iaHR0cDovL3d3dy5vbWcub3JnL3NwZWMvQlBNTi8yMDEwMDUyNC9ESSIgeG1sbnM6ZGM9Imh0dHA6Ly93d3cub21nLm9yZy9zcGVjL0RELzIwMTAwNTI0L0RDIiB4bWxuczpkaT0iaHR0cDovL3d3dy5vbWcub3JnL3NwZWMvREQvMjAxMDA1MjQvREkiIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiIHhtbG5zOnplZWJlPSJodHRwOi8vY2FtdW5kYS5vcmcvc2NoZW1hL3plZWJlLzEuMCIgZXhwb3J0ZXI9IkNhbXVuZGEgTW9kZWxlciIgZXhwb3J0ZXJWZXJzaW9uPSIxLjguMiIgZXhwcmVzc2lvbkxhbmd1YWdlPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L1hQYXRoIiBpZD0iRGVmaW5pdGlvbnNfMSIgdGFyZ2V0TmFtZXNwYWNlPSJodHRwOi8vYnBtbi5pby9zY2hlbWEvYnBtbiIgdHlwZUxhbmd1YWdlPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYSI+CiAgICAKICA8YnBtbjpwcm9jZXNzIGlkPSJwcm9jZXNzIiBpc0Nsb3NlZD0iZmFsc2UiIGlzRXhlY3V0YWJsZT0idHJ1ZSIgcHJvY2Vzc1R5cGU9Ik5vbmUiPgogICAgICAgIAogICAgPGJwbW46c3RhcnRFdmVudCB...}' without errors, but exception occurred with message 'Expected to claim segment of size 8358688, but can't claim more than 4194304 bytes.'.

Which somewhat hints that it's a size issue, but I don't think it's very clear for the user what they have to do. I would argue this isn't deployment specific - any time we would fail to grab a segment on the dispatcher due to size, this will be an issue and this error will be returned. I imagine this can happen during processing of other commands as well (even if it's less likely).

In this specific case, this could be solved by checking the size before even writing it to the dispatcher - since it's much bigger than a segment, it's pretty obvious it will fail. If it happens later during enriching this is then internal and it's a bit of a problem, and I'm not sure how it will end up being reported. imo this falls under the whole topic of dealing with maximum message sizes and so on, and I'd like to tackle that at a higher level. For this particular issue, I'm not sure how we could improve the report to the user, as this is a generic error handling.

@npepinpe npepinpe added area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) and removed Impact: Availability labels Apr 11, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Aug 1, 2022

Seems to be no longer the case. @korthout tried to reproduce this issue during the preparation of a game day. I will close this for now.

@Zelldon Zelldon closed this as completed Aug 1, 2022
github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2024
* feat(feature-flagged): add header for left diagram

* refactor: rename computed value from isLastStep to isSummaryStep

* chore: set IS_INSTANCE_MIGRATION_ENABLED to false
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

No branches or pull requests

4 participants