New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-15378: fix streams upgrade system test #14539
KAFKA-15378: fix streams upgrade system test #14539
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Triggered a system test run: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5885/ |
Seems some system test still failed... Let me look into it and see if I can producer locally... I did run a few locally already which did pass... 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at #13860 too, I understand the motivation of this one now. LGTM!
9ca88e3
to
3d9ea7d
Compare
Triggered a new system test build: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5891/ Some tests seems to be flaky (cf #13860 (comment)) -- let's see what the system test result is, and make a call to merge or add more fixes... |
@@ -100,7 +100,7 @@ def test_streams_runs_with_broker_down_initially(self, metadata_quorum): | |||
processor_3 = StreamsBrokerDownResilienceService(self.test_context, self.kafka, configs) | |||
processor_3.start() | |||
|
|||
broker_unavailable_message = "Broker may not be available" | |||
broker_unavailable_message = "Node may not be available" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log message was changed via fcac880
# -> https://issues.apache.org/jira/browse/KAFKA-14646 | ||
# thus, we cannot test two bounce rolling upgrade because we know it's broken | ||
# instead we add version 2.4...3.3 to the `metadata_2_versions` upgrade list | ||
#fk_join_versions = [str(LATEST_3_4)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimaison You will need to uncomment this, and also add 3.5 release to this list in your PR, and reenable the corresponding @matix
annotation, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, I'll do that once this is merged. Thanks
The following system tests failed:
Overall, we are not in good shape :( Triggered a re-run to see what is noise: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5896/ But I actually believe, we might want to merge this PR as-is to unblock Mickeal's PR, and tackle each of these test one-by-one as follow up work? Thoughts? @mimaison @guozhangwang |
I agree, it seems it may take a while to fix all these failures so let's merge these PRs. |
Test failures. Much fewer than before. A few test (ie, "From version" seems to be "stable" while others are not -- will keep digging.
|
@mjsax Can you share the TC_PATHS, ducktape options and specs of the machines you used to run the system tests? I'm really having troubles getting any of them pass regularly in my environment. Thanks |
I did not run all of them locally yet... the upgrade tests, and cooperative rebalancing ones only. I am running them on my Mac, macOS Monterey (12.7), 2.3GHz 8-Core Intel i9 -- 32GB DDR4 I often modify the test python code to run a single configuration only and run a single test case (ie python method). Otherwise I don't change anything. I also delete the docker images regularly and let them re-build (especially when switching branches). |
For example, just re-run:
Only the run for 2.6.3 failed. Looking into the test failure, the issue was that the broker did not startup on time, and it run into a test timeout. Broker log
So I re-run just this single configuration for 2.6.3 and it passed afterwards. |
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release. This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release. This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release. This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
Merged to |
Just looked into the branch builder results for Broker log shows:
And the console output show
The Maybe we should keep observing And cooperative-rebalancing does not work with older version thus does hot hit it, and seems to be flaky. |
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release. This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release. This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release. This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com> Co-authored-by: Matthias J. Sax <matthias@confluent.io>
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.
This PR updates the system test matrix to only test working versions with FK-joins. As we don't have added system test to upgrade from 3.4 and 3.5 yet (PR in progress: #13860) we don't have any test run with FK-joins in this PR. PR 13860 will need to add those versions to close the gap.
This PR also disables the new state-updater-thread that still seems to be buggy crashing system tests.
This PR should be cherry-picked to older branches, too.