Skip to content

[tests] Fix Kafka E2E test timeout and retry race condition#2358

Merged
lexfrei merged 1 commit intomainfrom
fix/kafka-e2e-timeout
Apr 8, 2026
Merged

[tests] Fix Kafka E2E test timeout and retry race condition#2358
lexfrei merged 1 commit intomainfrom
fix/kafka-e2e-timeout

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented Apr 8, 2026

What this PR does

Fixes Kafka E2E test (hack/e2e-apps/kafka.bats) that fails consistently in CI due to two issues:

  1. Insufficient timeout — Kafka CR readiness timeout was 60s, but Strimzi with 4 JVM pods (2 Kafka + 2 ZooKeeper) on QEMU-based CI sandbox needs 3-5 minutes to fully start. Increased to 300s, consistent with other heavyweight tests (MongoDB 300s, FoundationDB 300s, Harbor 600s).

  2. Retry race condition — On retry attempts, kubectl apply could hit a still-deleting resource from the previous attempt, resulting in "unchanged" on a deleting object and no new HelmRelease being created. Added wait --for=delete before re-applying, following the existing pattern in harbor.bats and run-kubernetes.sh.

CI evidence: https://github.com/cozystack/cozystack/actions/runs/24135612137/job/70424376181

Release note

[tests] Fix Kafka E2E timeout (60s→300s) and retry race condition

Summary by CodeRabbit

  • Tests
    • Enhanced Kafka end-to-end test reliability by ensuring complete resource cleanup before reapplying configurations.
    • Increased Kafka readiness verification timeout to accommodate longer initialization periods during testing.

Increase Kafka CR readiness timeout from 60s to 300s to account for
slow Strimzi startup on QEMU-based CI sandbox (4 JVM pods).

Add wait-for-delete before re-applying to prevent race condition where
kubectl apply hits a still-deleting resource on retry attempts.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@dosubot dosubot Bot added size/XS This PR changes 0-9 lines, ignoring generated files kind/bug Categorizes issue or PR as related to a bug labels Apr 8, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

The e2e test for Kafka now adds explicit resource deletion verification using kubectl wait before reapplying the manifest, and increases the readiness timeout from 60 seconds to 300 seconds to allow more time for the Kafka resource to become ready.

Changes

Cohort / File(s) Summary
Test Configuration Update
hack/e2e-apps/kafka.bats
Added kubectl wait --for=delete verification after deleting Kafka resources to ensure complete cleanup before reapplication. Increased readiness condition timeout from 60 to 300 seconds for improved test stability.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A patient test now waits with care,
For Kafka's traces in the air,
Three hundred seconds, generous and long,
Ensure the pods sing readiness song! 🎵

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the two main fixes in the changeset: increased timeout and retry race condition resolution in the Kafka E2E test.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/kafka-e2e-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Kafka end-to-end test script to ensure proper resource deletion by adding a wait command and increases the timeout for the Kafka readiness check. The review suggests replacing the hardcoded resource name with the $name variable and using the full resource type name for better consistency and maintainability.

Comment thread hack/e2e-apps/kafka.bats
sleep 5
kubectl -n tenant-test wait hr kafka-$name --timeout=30s --for=condition=ready
kubectl wait kafkas -n tenant-test test --timeout=60s --for=condition=ready
kubectl wait kafkas -n tenant-test test --timeout=300s --for=condition=ready
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The resource name is hardcoded as test here, which is inconsistent with the rest of the script that uses the $name variable. Additionally, using the full resource name kafka.apps.cozystack.io and consistent flag ordering (placing -n before the command) improves maintainability and aligns with the patterns used elsewhere in this file.

  kubectl -n tenant-test wait kafka.apps.cozystack.io $name --timeout=300s --for=condition=ready

@lexfrei lexfrei self-assigned this Apr 8, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hack/e2e-apps/kafka.bats (1)

44-44: Timeout increase is appropriate; consider using $name variable for consistency.

The 300s timeout aligns well with other heavy tests (MongoDB, FoundationDB). However, the resource name is hardcoded as test instead of using the $name variable, which is inconsistent with the rest of the test (e.g., line 5 uses $name).

♻️ Suggested fix for consistency
-  kubectl wait kafkas -n tenant-test test --timeout=300s --for=condition=ready
+  kubectl wait kafkas -n tenant-test $name --timeout=300s --for=condition=ready
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/e2e-apps/kafka.bats` at line 44, Replace the hardcoded resource name
"test" in the kubectl wait invocation with the existing variable $name to match
the rest of the script; locate the kubectl wait call (kubectl wait kafkas -n
tenant-test test --timeout=300s --for=condition=ready) and change the resource
argument to use $name so it becomes consistent with other references to $name in
the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@hack/e2e-apps/kafka.bats`:
- Line 44: Replace the hardcoded resource name "test" in the kubectl wait
invocation with the existing variable $name to match the rest of the script;
locate the kubectl wait call (kubectl wait kafkas -n tenant-test test
--timeout=300s --for=condition=ready) and change the resource argument to use
$name so it becomes consistent with other references to $name in the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1d778766-7c90-44ab-93e5-8868e5e373f8

📥 Commits

Reviewing files that changed from the base of the PR and between 38624b7 and e16908b.

📒 Files selected for processing (1)
  • hack/e2e-apps/kafka.bats

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Apr 8, 2026
@lexfrei lexfrei merged commit d27b01c into main Apr 8, 2026
13 checks passed
@lexfrei lexfrei deleted the fix/kafka-e2e-timeout branch April 8, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug lgtm This PR has been approved by a maintainer size/XS This PR changes 0-9 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants