Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
12402: Aggregate equivalent client streams r=deepthidevaki a=deepthidevaki

## Description

Aggregates client streams with the same `streamType` and `metadata` to a single stream that is registered with the server. Payloads pushed to this server stream will be distributed to one of the registered client stream. Currently, the client stream is picked randomly. Later, we can employ a better strategy to chose the client stream to push the payload.

The streamId for the aggregated stream is generated randomly. When all existing client streams are removed, the corresponding server stream will also be removed. When a new client stream is registered with the same streamType and metadata, a new aggregated stream will be created with a new streamId. This is ok, because as long there is an aggregated stream in the registry, new client streams will be added to it. Not re-using the previous streamId also helps to prevent edge cases where concurrent remove (of old aggregated stream) and add (of new aggregated stream) requests caused by retried resulting in inconsistent state.

## Related issues

closes #12253 



12406: ci(integration): split module and integration test jobs r=megglos a=megglos

## Description
Reduces overall runtime as the module tests and qa-integration tests cannot run in parallel due to the maven module inter-dependencies. Thus extracting module ITs into a dedicated job allows us to get the overall IT stages down to ~ 10 minutes, while on main the `Integration tests` job that combines module and qa integration tests shows runtimes of about 15m.

By that chance introduced a shared IT job setup that can be configured through a matrix.

By also looking at the unit test summary I wondered why the s3 unit tests take about 2m to complete, which is where I found that some ITs were actually run as unit tests. I made sure they are run as ITs going forward [by renaming them](3b5781f).

In total the whole CI run duration is not dominated by the integration test job anymore but by multiple that oscillate around 10m. 

relates to #12028 


12421: feat: default to a better raft request timeout r=oleschoenburg a=oleschoenburg

Using the old default values of:

```yaml
zeebe.broker:
  cluster:
    electionTimeout: 2.5s
  raft:
    enablePriorityElection: true
  experimental:
    maxAppendsPerFollower: 2
    raft:
      requestTimeout: 5s
```

the loss of 2 requests between primary(leader) and secondary(follower) could trigger unnecessary re-election because the secondary would not receive any requests from the primary for at least 5 seconds which exceeds election timeout.

This changes the default request timeout to always match the default election timeout. Using all default values, we get at least one more request attempt between primary and secondary before re-election and probably more, depending on the exact timing when requests are sent.

closes #12009 


12426: Disable test results comment r=npepinpe a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
Disables the test results comment that gets added to PRs. As a team it was decided this was not useful.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #



12428: test(qa): save logs of zeebe containers if the test fails r=deepthidevaki a=deepthidevaki

## Description

There were no logs from the brokers or gateway. So it was not possible to debug flaky test  #12396 


Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
Co-authored-by: Sebastian Bathke (Meggle) <sebastian.bathke@camunda.com>
Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
  • Loading branch information
5 people committed Apr 14, 2023
6 parents fd19810 + e17106b + 79e95f1 + 25a334d + 32ad540 + 6c13082 commit ab489d3
Show file tree
Hide file tree
Showing 20 changed files with 416 additions and 210 deletions.
107 changes: 35 additions & 72 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,74 +24,38 @@ env:

jobs:
integration-tests:
name: Integration tests
runs-on: [ self-hosted, linux, amd64, "16" ]
timeout-minutes: 45
env:
TC_CLOUD_LOGS_VERBOSE: true
TC_CLOUD_TOKEN: ${{ secrets.TC_CLOUD_TOKEN }}
TC_CLOUD_CONCURRENCY: 2
ZEEBE_TEST_DOCKER_IMAGE: localhost:5000/camunda/zeebe:current-test
services:
registry:
image: registry:2
ports:
- 5000:5000
steps:
- uses: actions/checkout@v3
- uses: ./.github/actions/setup-zeebe
with:
maven-cache: 'true'
secret_vault_secretId: ${{ secrets.VAULT_SECRET_ID }}
secret_vault_address: ${{ secrets.VAULT_ADDR }}
secret_vault_roleId: ${{ secrets.VAULT_ROLE_ID }}
- uses: ./.github/actions/build-zeebe
id: build-zeebe
with:
maven-extra-args: -T1C
- uses: ./.github/actions/build-docker
with:
repository: localhost:5000/camunda/zeebe
version: current-test
push: true
distball: ${{ steps.build-zeebe.outputs.distball }}
- name: Prepare Testcontainers Cloud agent
if: env.TC_CLOUD_TOKEN != ''
run: |
curl -L -o agent https://app.testcontainers.cloud/download/testcontainers-cloud-agent_linux_x86-64
chmod +x agent
./agent --private-registry-url=http://localhost:5000 '--private-registry-allowed-image-name-globs=*,*/*' > .testcontainers-agent.log 2>&1 &
./agent wait
- name: Create build output log file
run: echo "BUILD_OUTPUT_FILE_PATH=$(mktemp)" >> $GITHUB_ENV
- name: Maven Test Build
run: >
./mvnw -B -T2 --no-snapshot-updates
-D forkCount=5
-D maven.javadoc.skip=true
-D skipUTs -D skipChecks
-D failsafe.rerunFailingTestsCount=3 -D flaky.test.reportDir=failsafe-reports
-P parallel-tests,extract-flaky-tests
-pl '!qa/update-tests'
verify
| tee "${BUILD_OUTPUT_FILE_PATH}"
- name: Duplicate Test Check
uses: ./.github/actions/check-duplicate-tests
with:
buildOutputFilePath: ${{ env.BUILD_OUTPUT_FILE_PATH }}
- name: Upload test artifacts
uses: ./.github/actions/collect-test-artifacts
if: always()
with:
name: Integration Tests
qa-update-tests:
name: QA Update tests
name: "[IT] ${{ matrix.name }}"
timeout-minutes: 20
runs-on: [ self-hosted, linux, amd64, "16" ]
timeout-minutes: 45
strategy:
fail-fast: false
matrix:
group: [ modules, qa-integration, qa-update ]
include:
- group: modules
name: "Module Integration Tests"
maven-modules: "'!qa/integration-tests,!qa/update-tests'"
maven-build-threads: 2
maven-test-fork-count: 7
tcc-enabled: 'false'
- group: qa-integration
name: "QA Integration Tests"
maven-modules: "qa/integration-tests"
maven-build-threads: 1
maven-test-fork-count: 10
tcc-enabled: 'true'
tcc-concurrency: 2
- group: qa-update
name: "QA Update Tests"
maven-modules: "qa/update-tests"
maven-build-threads: 1
maven-test-fork-count: 10
tcc-enabled: 'true'
tcc-concurrency: 2
env:
TC_CLOUD_LOGS_VERBOSE: true
TC_CLOUD_TOKEN: ${{ secrets.TC_CLOUD_TOKEN }}
TC_CLOUD_CONCURRENCY: 2
TC_CLOUD_TOKEN: ${{ matrix.tcc-enabled == 'true' && secrets.TC_CLOUD_TOKEN || '' }}
TC_CLOUD_CONCURRENCY: ${{ matrix.tcc-concurrency }}
ZEEBE_TEST_DOCKER_IMAGE: localhost:5000/camunda/zeebe:current-test
services:
registry:
Expand Down Expand Up @@ -127,12 +91,13 @@ jobs:
run: echo "BUILD_OUTPUT_FILE_PATH=$(mktemp)" >> $GITHUB_ENV
- name: Maven Test Build
run: >
./mvnw -B -T2 --no-snapshot-updates
./mvnw -B -T ${{ matrix.maven-build-threads }} --no-snapshot-updates
-D forkCount=${{ matrix.maven-test-fork-count }}
-D maven.javadoc.skip=true
-D skipUTs -D skipChecks
-D failsafe.rerunFailingTestsCount=3 -D flaky.test.reportDir=failsafe-reports
-P parallel-tests,extract-flaky-tests
-pl qa/update-tests
-pl ${{ matrix.maven-modules }}
verify
| tee "${BUILD_OUTPUT_FILE_PATH}"
- name: Duplicate Test Check
Expand All @@ -143,7 +108,7 @@ jobs:
uses: ./.github/actions/collect-test-artifacts
if: always()
with:
name: QA Update Tests
name: "[IT] ${{ matrix.name }}"
unit-tests:
name: Unit tests
runs-on: [ self-hosted, linux, amd64, "16" ]
Expand Down Expand Up @@ -185,7 +150,7 @@ jobs:
with:
name: "unit tests"
smoke-tests:
name: Smoke tests on ${{ matrix.os }} with ${{ matrix.arch }}
name: "[Smoke] ${{ matrix.os }} with ${{ matrix.arch }}"
timeout-minutes: 20
runs-on: ${{ matrix.runner }}
strategy:
Expand Down Expand Up @@ -243,7 +208,7 @@ jobs:
uses: ./.github/actions/collect-test-artifacts
if: always()
with:
name: Smoke Tests on ${{ matrix.os }} with ${{ matrix.arch }}
name: "[Smoke] ${{ matrix.os }} with ${{ matrix.arch }}"
property-tests:
name: Property Tests
runs-on: [ self-hosted, linux, amd64, "16" ]
Expand Down Expand Up @@ -489,7 +454,6 @@ jobs:
runs-on: ubuntu-latest
needs:
- integration-tests
- qa-update-tests
- unit-tests
- smoke-tests
- property-tests
Expand All @@ -509,7 +473,6 @@ jobs:
runs-on: ubuntu-latest
needs:
- integration-tests
- qa-update-tests
- unit-tests
- smoke-tests
- property-tests
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/publish-test-results.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,4 @@ jobs:
junit_files: |
artifacts/**/surefire-reports/*.xml
artifacts/**/failsafe-reports/TEST-*.xml
comment_mode: off
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
import org.junit.runners.Parameterized.Parameters;

@RunWith(Parameterized.class)
public class RaftFailOverTest {
public class RaftFailOverIT {

@Rule @Parameter public RaftRule raftRule;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@
import org.junit.runners.Parameterized.Parameters;

@RunWith(Parameterized.class)
public class ZeebeTest {
public class ZeebeIT {

// rough estimate of how many entries we'd need to write to fill a segment
// segments are configured for 1kb, and one entry takes ~30 bytes (plus some metadata I guess)
Expand Down
12 changes: 12 additions & 0 deletions atomix/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,18 @@
<suppressionsLocation>src/main/resources/suppression.xml</suppressionsLocation>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-failsafe-plugin</artifactId>
<configuration>
<includes>
<include>**/IT*.java</include>
<include>**/*IT.java</include>
<include>**/*ITCase.java</include>
</includes>
</configuration>
</plugin>
</plugins>
</build>
</project>
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
import software.amazon.awssdk.services.s3.model.CreateBucketRequest;

@Testcontainers
final class CompressionTest {
final class CompressionIT {
private static final String ACCESS_KEY = "letmein";
private static final String SECRET_KEY = "letmein1234";
private static final int DEFAULT_PORT = 9000;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,8 @@
import org.apache.commons.lang3.RandomStringUtils;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.testcontainers.junit.jupiter.Testcontainers;
import software.amazon.awssdk.regions.Region;

@Testcontainers
final class ConnectionErrorTest {
private static final String ACCESS_KEY = "letmein";
private static final String SECRET_KEY = "letmein1234";
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
import software.amazon.awssdk.services.s3.model.CreateBucketRequest;

@Testcontainers
final class CustomBasePathTest {
final class CustomBasePathIT {
private static final String ACCESS_KEY = "letmein";
private static final String SECRET_KEY = "letmein1234";
private static final int DEFAULT_PORT = 9000;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ public final class ClusterCfg implements ConfigurationEntry {
public static final int DEFAULT_REPLICATION_FACTOR = 1;
public static final int DEFAULT_CLUSTER_SIZE = 1;
public static final String DEFAULT_CLUSTER_NAME = "zeebe-cluster";
public static final Duration DEFAULT_ELECTION_TIMEOUT = Duration.ofMillis(2500);

private static final String NODE_ID_ERROR_MSG =
"Node id %s needs to be non negative and smaller then cluster size %s.";
private static final String REPLICATION_FACTOR_ERROR_MSG =
Expand All @@ -38,7 +40,6 @@ public final class ClusterCfg implements ConfigurationEntry {
+ " quorum = {}. If you want to ensure high fault-tolerance and availability,"
+ " make sure to use an odd replication factor.";
private static final Duration DEFAULT_HEARTBEAT_INTERVAL = Duration.ofMillis(250);
private static final Duration DEFAULT_ELECTION_TIMEOUT = Duration.ofMillis(2500);

private List<String> initialContactPoints = DEFAULT_CONTACT_POINTS;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,15 @@
*/
package io.camunda.zeebe.broker.system.configuration;

import static io.camunda.zeebe.broker.system.configuration.ClusterCfg.DEFAULT_ELECTION_TIMEOUT;

import java.time.Duration;

public final class ExperimentalRaftCfg implements ConfigurationEntry {

private static final Duration DEFAULT_REQUEST_TIMEOUT = Duration.ofSeconds(5);
// Requests should time out faster than the election timeout to ensure that a single missed
// heartbeat does not cause immediate re-election.
private static final Duration DEFAULT_REQUEST_TIMEOUT = DEFAULT_ELECTION_TIMEOUT;
private static final Duration DEFAULT_MAX_QUORUM_RESPONSE_TIMEOUT = Duration.ofSeconds(0);
private static final int DEFAULT_MIN_STEP_DOWN_FAILURE_COUNT = 3;
private static final int DEFAULT_PREFER_SNAPSHOT_REPLICATION_THRESHOLD = 100;
Expand Down
2 changes: 1 addition & 1 deletion dist/src/main/config/broker.standalone.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -896,7 +896,7 @@
# raft:
# Sets the timeout for all requests send by raft leaders and followers.
# This setting can also be overridden using the environment variable ZEEBE_BROKER_EXPERIMENTAL_RAFT_REQUESTTIMEOUT
# requestTimeout: 5s
# requestTimeout: 2500ms

# If the leader is not able to reach the quorum, the leader may step down.
# This is triggered after a number of requests, to a quorum of followers, has failed, and the number of failures
Expand Down
2 changes: 1 addition & 1 deletion dist/src/main/config/broker.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -806,7 +806,7 @@
# raft:
# Sets the timeout for all requests send by raft leaders and followers.
# This setting can also be overridden using the environment variable ZEEBE_BROKER_EXPERIMENTAL_RAFT_REQUESTTIMEOUT
# requestTimeout: 5s
# requestTimeout: 2500ms

# If the leader is not able to reach the quorum, the leader may step down.
# This is triggered after a number of requests, to a quorum of followers, has failed, and the number of failures
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,15 @@
import io.camunda.zeebe.client.ZeebeClient;
import io.camunda.zeebe.client.api.response.PartitionInfo;
import io.camunda.zeebe.qa.util.actuator.RebalanceActuator;
import io.camunda.zeebe.qa.util.testcontainers.ContainerLogsDumper;
import io.camunda.zeebe.qa.util.testcontainers.ZeebeTestContainerDefaults;
import io.zeebe.containers.ZeebeGatewayNode;
import io.zeebe.containers.cluster.ZeebeCluster;
import java.time.Duration;
import org.awaitility.Awaitility;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.RegisterExtension;
import org.testcontainers.containers.Network;
import org.testcontainers.junit.jupiter.Container;
import org.testcontainers.junit.jupiter.Testcontainers;
Expand All @@ -36,6 +38,10 @@ final class RebalancingEndpointIT {
.withNetwork(network)
.build();

@RegisterExtension
@SuppressWarnings("unused")
final ContainerLogsDumper logsWatcher = new ContainerLogsDumper(cluster::getNodes);

private ZeebeClient client;

@BeforeEach
Expand Down
Loading

0 comments on commit ab489d3

Please sign in to comment.