Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-14498: reduce the startup nodes to avoid timeout error #13016

Merged
merged 2 commits into from Dec 21, 2022

Conversation

showuon
Copy link
Contributor

@showuon showuon commented Dec 19, 2022

In MetadataQuorumCommandTest, we sometimes got the error:

java.util.concurrent.ExecutionException: java.lang.RuntimeException: Received a fatal error while waiting for the broker to catch up with the current cluster metadata.

Since we tried to bring up 3 broker + 3 controllers at the same time, and the config initial.broker.registration.timeout.ms (default 1 min) is sometimes not enough for them to start up. Checking the tests, it doesn't require so many nodes. Reducing the nodes number to make these tests reliable.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@showuon
Copy link
Contributor Author

showuon commented Dec 19, 2022

@dengziming , please take a look. Thanks.

@ClusterTest(clusterType = Type.CO_KRAFT, brokers = 2, controllers = 3),
@ClusterTest(clusterType = Type.KRAFT, brokers = 2, controllers = 3)
@ClusterTest(clusterType = Type.CO_KRAFT, brokers = 1, controllers = 1),
@ClusterTest(clusterType = Type.KRAFT, brokers = 1, controllers = 1),
Copy link
Contributor

@ijuma ijuma Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name indicates its testing replication, that doesn't happen if there's a single broker & controller, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let me use 2 + 2 nodes instead

@showuon
Copy link
Contributor Author

showuon commented Dec 20, 2022

@ijuma , PR updated. I changed to 2 brokers + 2 controllers for replication test and status test, because we already have a testOnlyOneBrokerAndOneController to test 1 + 1 node case. Here, we should test multiple nodes case.

Copy link
Member

@dengziming dengziming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish this can fix the flaky tests.

Copy link
Contributor

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Copy link
Contributor

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do we support running kraft with just 2 members in the controller quorum?

@showuon
Copy link
Contributor Author

showuon commented Dec 21, 2022

Actually, do we support running kraft with just 2 members in the controller quorum?

Yes, I don't see why we can't support that. The quorum of 2 members will be 2. It just doesn't tolerate missing any node.

@ijuma
Copy link
Contributor

ijuma commented Dec 21, 2022

I guess you're saying we don't care about the ability to tolerate failures in this particular test. Fair enough.

@showuon
Copy link
Contributor Author

showuon commented Dec 21, 2022

I guess you're saying we don't care about the ability to tolerate failures in this particular test. Fair enough.

Yes, you're right. Thanks for the comment.

@showuon
Copy link
Contributor Author

showuon commented Dec 21, 2022

Failed tests are unrelated

    Build / JDK 8 and Scala 2.12 / kafka.network.SocketServerTest.idleExpiryWithBufferedReceives()
    Build / JDK 17 and Scala 2.13 / kafka.server.DynamicBrokerReconfigurationTest.testTrustStoreAlter(String).quorum=kraft
    Build / JDK 17 and Scala 2.13 / org.apache.kafka.streams.integration.SmokeTestDriverIntegrationTest.[1] true
    Build / JDK 11 and Scala 2.13 / org.apache.kafka.clients.consumer.CooperativeStickyAssignorTest.testLargeAssignmentAndGroupWithNonEqualSubscription()
    Build / JDK 11 and Scala 2.13 / org.apache.kafka.clients.consumer.StickyAssignorTest.testLargeAssignmentAndGroupWithNonEqualSubscription()
    Build / JDK 11 and Scala 2.13 / kafka.server.DynamicBrokerReconfigurationTest.testTrustStoreAlter(String).quorum=kraft
    Build / JDK 11 and Scala 2.13 / org.apache.kafka.streams.integration.SmokeTestDriverIntegrationTest.[1] true

@showuon showuon merged commit 2575362 into apache:trunk Dec 21, 2022
@showuon
Copy link
Contributor Author

showuon commented Dec 21, 2022

@ableegoldman , this is to help make some tests reliable by decreasing the testing nodes (low risk). Do you think this should backport to 3.4?

@ableegoldman
Copy link
Contributor

@showuon yes, thanks, I'll backport it to 3.4

ableegoldman pushed a commit that referenced this pull request Dec 30, 2022
In MetadataQuorumCommandTest, we sometimes got the error:

java.util.concurrent.ExecutionException: java.lang.RuntimeException: Received a fatal error while waiting for the broker to catch up with the current cluster metadata.

Since we tried to bring up 3 broker + 3 controllers at the same time, and the config initial.broker.registration.timeout.ms (default 1 min) is sometimes not enough for them to start up. Checking the tests, it doesn't require so many nodes. Reducing the nodes number to make these tests reliable.

Reviewers: dengziming <dengziming1993@gmail.com>, Ismael Juma <ismael@juma.me.uk>
guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request Jan 25, 2023
…13016)

In MetadataQuorumCommandTest, we sometimes got the error:

java.util.concurrent.ExecutionException: java.lang.RuntimeException: Received a fatal error while waiting for the broker to catch up with the current cluster metadata.

Since we tried to bring up 3 broker + 3 controllers at the same time, and the config initial.broker.registration.timeout.ms (default 1 min) is sometimes not enough for them to start up. Checking the tests, it doesn't require so many nodes. Reducing the nodes number to make these tests reliable.

Reviewers: dengziming <dengziming1993@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants