Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15378: fix streams upgrade system test #14539

Merged
merged 4 commits into from Oct 20, 2023

Conversation

mjsax
Copy link
Member

@mjsax mjsax commented Oct 12, 2023

Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.

This PR updates the system test matrix to only test working versions with FK-joins. As we don't have added system test to upgrade from 3.4 and 3.5 yet (PR in progress: #13860) we don't have any test run with FK-joins in this PR. PR 13860 will need to add those versions to close the gap.

This PR also disables the new state-updater-thread that still seems to be buggy crashing system tests.

This PR should be cherry-picked to older branches, too.

@mjsax mjsax added streams tests Test fixes (including flaky tests) labels Oct 12, 2023
Copy link
Collaborator

@lihaosky lihaosky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mjsax
Copy link
Member Author

mjsax commented Oct 12, 2023

@mjsax
Copy link
Member Author

mjsax commented Oct 13, 2023

Seems some system test still failed... Let me look into it and see if I can producer locally... I did run a few locally already which did pass... 🤔

Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at #13860 too, I understand the motivation of this one now. LGTM!

@mjsax mjsax force-pushed the kafka-15378-fix-streams-upgrade-test branch from 9ca88e3 to 3d9ea7d Compare October 18, 2023 03:35
@mjsax
Copy link
Member Author

mjsax commented Oct 18, 2023

Rebased this PR to pick-up bug-fix #14555 (bug was exposed via system test). -> Re-enable state-updater.

Also added a fix for streams_broker_down_resilience_test that was broken by a recent commit (fcac880) which changed an expected log message.

@mjsax
Copy link
Member Author

mjsax commented Oct 18, 2023

Triggered a new system test build: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5891/

Some tests seems to be flaky (cf #13860 (comment)) -- let's see what the system test result is, and make a call to merge or add more fixes...

@@ -100,7 +100,7 @@ def test_streams_runs_with_broker_down_initially(self, metadata_quorum):
processor_3 = StreamsBrokerDownResilienceService(self.test_context, self.kafka, configs)
processor_3.start()

broker_unavailable_message = "Broker may not be available"
broker_unavailable_message = "Node may not be available"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log message was changed via fcac880

# -> https://issues.apache.org/jira/browse/KAFKA-14646
# thus, we cannot test two bounce rolling upgrade because we know it's broken
# instead we add version 2.4...3.3 to the `metadata_2_versions` upgrade list
#fk_join_versions = [str(LATEST_3_4)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimaison You will need to uncomment this, and also add 3.5 release to this list in your PR, and reenable the corresponding @matix annotation, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I'll do that once this is merged. Thanks

@mjsax
Copy link
Member Author

mjsax commented Oct 19, 2023

The following system tests failed:

  • test_upgrade_to_cooperative_rebalance
    • 0.10.1.1
    • 0.10.2.2
    • 1.0.2
    • 1.1.1
    • 2.0.1
    • 2.3.1
  • test_app_upgrade
    • 2.6.3 / full
    • 2.7.2 / full
    • 3.3.2 / full
  • test_rolling_upgrade_with_2_bounces
    • 0.10.0.1
    • 0.10.1.1
    • 0.10.2.2
    • 0.11.0.3
    • 1.0.2
    • 2.6.3
    • 2.7.2
    • 3.3.2
  • test_broker_type_bounce
    • "broker_type": "controller", "failure_mode": "hard_shutdown", "metadata_quorum": "ZK",
    • "broker_type": "leader", "failure_mode": "hard_shutdown", "metadata_quorum": "ISOLATED_KRAFT",
    • "broker_type": "leader", "failure_mode": "hard_shutdown", "metadata_quorum": "ZK",
  • test_many_brokers_bounce
    • failure_mode": "clean_shutdown", "metadata_quorum": "ISOLATED_KRAFT",
    • "failure_mode": "clean_shutdown", "metadata_quorum": "ZK",
  • test_compatible_brokers_eos_alpha_enabled
    • 2.6.3
    • 2.7.2
    • 3.3.2
  • test_compatible_brokers_eos_disabled
    • 2.6.3
    • 2.7.2
    • 3.3.2
  • test_compatible_brokers_eos_v2_enabled
    • 2.6.3
    • 2.7.2
    • 3.3.2

Overall, we are not in good shape :(

Triggered a re-run to see what is noise: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5896/

But I actually believe, we might want to merge this PR as-is to unblock Mickeal's PR, and tackle each of these test one-by-one as follow up work? Thoughts? @mimaison @guozhangwang

@mimaison
Copy link
Member

I agree, it seems it may take a while to fix all these failures so let's merge these PRs.

@mjsax
Copy link
Member Author

mjsax commented Oct 19, 2023

Test failures. Much fewer than before. A few test (ie, "From version" seems to be "stable" while others are not -- will keep digging.

  • test_upgrade_to_cooperative_rebalance

    • 0.11.0.3 (passed before)
    • 1.0.2 (failed again)
    • 1.1.1 (failed again)
    • 2.3.1 (failed again)
  • test_app_upgrade

    • 2.6.3 / full (failed again)
    • 2.7.2 / full (failed again)
    • 3.3.2 / full (failed again)
  • test_rolling_upgrade_with_2_bounces

    • 2.6.3 (failed again)
    • 2.7.2 (failed again)
    • 3.3.2 (failed again)
  • test_compatible_brokers_eos_alpha_enabled

    • 2.6.3 (failed again)
    • 2.7.2 (failed again)
    • 3.3.2 (failed again)
  • test_compatible_brokers_eos_disabled

  • 2.6.3 (failed again)

  • 2.7.2 (failed again)

  • 3.3.2 (failed again)

  • test_compatible_brokers_eos_v2_enabled

    • 2.6.3 (failed again)
    • 2.7.2 (failed again)
    • 3.3.2 (failed again)

    Triggered Jenkin re-run to get a clean build. Plan to merge afterwards.

@mimaison
Copy link
Member

@mjsax Can you share the TC_PATHS, ducktape options and specs of the machines you used to run the system tests? I'm really having troubles getting any of them pass regularly in my environment. Thanks

@mjsax
Copy link
Member Author

mjsax commented Oct 20, 2023

I did not run all of them locally yet... the upgrade tests, and cooperative rebalancing ones only.

I am running them on my Mac, macOS Monterey (12.7), 2.3GHz 8-Core Intel i9 -- 32GB DDR4

I often modify the test python code to run a single configuration only and run a single test case (ie python method). Otherwise I don't change anything.

I also delete the docker images regularly and let them re-build (especially when switching branches).

@mjsax mjsax merged commit 4371214 into apache:trunk Oct 20, 2023
1 check failed
@mjsax
Copy link
Member Author

mjsax commented Oct 20, 2023

For example, just re-run:

$ TC_PATHS="tests/kafkatest/tests/streams/streams_broker_compatibility_test.py::StreamsBrokerCompatibility.test_compatible_brokers_eos_v2_enabled" bash tests/docker/run_tests.sh

[...]

================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.11.4
session_id:       2023-10-20--004
run time:         5 minutes 50.448 seconds
tests run:        8
passed:           7
flaky:            0
failed:           1
ignored:          0
================================================================================

[...]

Only the run for 2.6.3 failed. Looking into the test failure, the issue was that the broker did not startup on time, and it run into a test timeout. Broker log
shows:

Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 9192; nested exception is: 
	java.net.BindException: Address already in use (Bind failed)

So I re-run just this single configuration for 2.6.3 and it passed afterwards.

@mjsax mjsax deleted the kafka-15378-fix-streams-upgrade-test branch October 20, 2023 23:33
mjsax added a commit that referenced this pull request Oct 20, 2023
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.

This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
mjsax added a commit that referenced this pull request Oct 20, 2023
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.

This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
mjsax added a commit that referenced this pull request Oct 20, 2023
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.

This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
@mjsax
Copy link
Member Author

mjsax commented Oct 20, 2023

Merged to trunk and cherry-picked to 3.6, 3.5, and 3.4 branches.

@mjsax
Copy link
Member Author

mjsax commented Oct 20, 2023

Just looked into the branch builder results for test_compatible_brokers_eos_v2_enabled 2.6.3 in more details.

Broker log shows:

bash: /opt/kafka-2.6.3/bin/kafka-server-start.sh: No such file or directory

And the console output show

worker4: + get_kafka 2.6.2 2.12
17:45:33     worker4: + version=2.6.2
17:45:33     worker4: + scala_version=2.12
17:45:33     worker4: + kafka_dir=/opt/kafka-2.6.2
17:45:33     worker4: + url=https://s3-us-west-2.amazonaws.com/kafka-packages/kafka_2.12-2.6.2.tgz
17:45:33     worker4: + url_streams_test=https://s3-us-west-2.amazonaws.com/kafka-packages/kafka-streams-2.6.2-test.jar
17:45:33     worker4: + '[' '!' -d /opt/kafka-2.6.2 ']'
17:45:33     worker4: /tmp /opt/jdk/8
17:45:33     worker4: + pushd /tmp
17:45:33     worker4: + curl --retry 5 -O https://s3-us-west-2.amazonaws.com/kafka-packages/kafka_2.12-2.6.2.tgz

The Dockerfile does use 2.6.3 though -- not sure where 2.6.2 come from? Can it be that this PR should have been rebased to pickup some Dockerfile updates I did recently (cdf726f)

Maybe we should keep observing trunk runs and see what it does... Given that it's always 2.6.2, 2.7.3, and 3.3.2 that failed above, and that's exactly the versions the other PR bumped, I see a clear relationship.

And cooperative-rebalancing does not work with older version thus does hot hit it, and seems to be flaky.

mjsax added a commit to confluentinc/kafka that referenced this pull request Nov 22, 2023
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.

This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
anurag-harness pushed a commit to anurag-harness/kafka that referenced this pull request Feb 9, 2024
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.

This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
anurag-harness added a commit to anurag-harness/kafka that referenced this pull request Feb 9, 2024
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.

This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>

Co-authored-by: Matthias J. Sax <matthias@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
streams tests Test fixes (including flaky tests)
Projects
None yet
4 participants