[tests] Fix Kafka E2E test timeout and retry race condition by lexfrei · Pull Request #2358 · cozystack/cozystack

lexfrei · 2026-04-08T14:34:28Z

What this PR does

Fixes Kafka E2E test (hack/e2e-apps/kafka.bats) that fails consistently in CI due to two issues:

Insufficient timeout — Kafka CR readiness timeout was 60s, but Strimzi with 4 JVM pods (2 Kafka + 2 ZooKeeper) on QEMU-based CI sandbox needs 3-5 minutes to fully start. Increased to 300s, consistent with other heavyweight tests (MongoDB 300s, FoundationDB 300s, Harbor 600s).
Retry race condition — On retry attempts, kubectl apply could hit a still-deleting resource from the previous attempt, resulting in "unchanged" on a deleting object and no new HelmRelease being created. Added wait --for=delete before re-applying, following the existing pattern in harbor.bats and run-kubernetes.sh.

CI evidence: https://github.com/cozystack/cozystack/actions/runs/24135612137/job/70424376181

Release note

[tests] Fix Kafka E2E timeout (60s→300s) and retry race condition

Summary by CodeRabbit

Tests
- Enhanced Kafka end-to-end test reliability by ensuring complete resource cleanup before reapplying configurations.
- Increased Kafka readiness verification timeout to accommodate longer initialization periods during testing.

Increase Kafka CR readiness timeout from 60s to 300s to account for slow Strimzi startup on QEMU-based CI sandbox (4 JVM pods). Add wait-for-delete before re-applying to prevent race condition where kubectl apply hits a still-deleting resource on retry attempts. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

coderabbitai · 2026-04-08T14:34:45Z

📝 Walkthrough

Walkthrough

The e2e test for Kafka now adds explicit resource deletion verification using kubectl wait before reapplying the manifest, and increases the readiness timeout from 60 seconds to 300 seconds to allow more time for the Kafka resource to become ready.

Changes

Cohort / File(s)	Summary
Test Configuration Update `hack/e2e-apps/kafka.bats`	Added `kubectl wait --for=delete` verification after deleting Kafka resources to ensure complete cleanup before reapplication. Increased readiness condition timeout from 60 to 300 seconds for improved test stability.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A patient test now waits with care,
For Kafka's traces in the air,
Three hundred seconds, generous and long,
Ensure the pods sing readiness song! 🎵

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the two main fixes in the changeset: increased timeout and retry race condition resolution in the Kafka E2E test.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/kafka-e2e-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the Kafka end-to-end test script to ensure proper resource deletion by adding a wait command and increases the timeout for the Kafka readiness check. The review suggests replacing the hardcoded resource name with the $name variable and using the full resource type name for better consistency and maintainability.

gemini-code-assist · 2026-04-08T14:36:22Z

  sleep 5
  kubectl -n tenant-test wait hr kafka-$name --timeout=30s --for=condition=ready
-  kubectl wait kafkas -n tenant-test test --timeout=60s --for=condition=ready
+  kubectl wait kafkas -n tenant-test test --timeout=300s --for=condition=ready


The resource name is hardcoded as test here, which is inconsistent with the rest of the script that uses the $name variable. Additionally, using the full resource name kafka.apps.cozystack.io and consistent flag ordering (placing -n before the command) improves maintainability and aligns with the patterns used elsewhere in this file.

kubectl -n tenant-test wait kafka.apps.cozystack.io $name --timeout=300s --for=condition=ready

coderabbitai

🧹 Nitpick comments (1)

hack/e2e-apps/kafka.bats (1)
44-44: Timeout increase is appropriate; consider using $name variable for consistency.

The 300s timeout aligns well with other heavy tests (MongoDB, FoundationDB). However, the resource name is hardcoded as test instead of using the $name variable, which is inconsistent with the rest of the test (e.g., line 5 uses $name).
♻️ Suggested fix for consistency
-  kubectl wait kafkas -n tenant-test test --timeout=300s --for=condition=ready
+  kubectl wait kafkas -n tenant-test $name --timeout=300s --for=condition=ready
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/e2e-apps/kafka.bats` at line 44, Replace the hardcoded resource name
"test" in the kubectl wait invocation with the existing variable $name to match
the rest of the script; locate the kubectl wait call (kubectl wait kafkas -n
tenant-test test --timeout=300s --for=condition=ready) and change the resource
argument to use $name so it becomes consistent with other references to $name in
the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@hack/e2e-apps/kafka.bats`:
- Line 44: Replace the hardcoded resource name "test" in the kubectl wait
invocation with the existing variable $name to match the rest of the script;
locate the kubectl wait call (kubectl wait kafkas -n tenant-test test
--timeout=300s --for=condition=ready) and change the resource argument to use
$name so it becomes consistent with other references to $name in the test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1d778766-7c90-44ab-93e5-8868e5e373f8

📥 Commits

Reviewing files that changed from the base of the PR and between 38624b7 and e16908b.

📒 Files selected for processing (1)

hack/e2e-apps/kafka.bats

dosubot Bot added size/XS This PR changes 0-9 lines, ignoring generated files kind/bug Categorizes issue or PR as related to a bug labels Apr 8, 2026

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

lexfrei self-assigned this Apr 8, 2026

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

IvanHunters approved these changes Apr 8, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label Apr 8, 2026

lexfrei merged commit d27b01c into main Apr 8, 2026
13 checks passed

lexfrei deleted the fix/kafka-e2e-timeout branch April 8, 2026 18:36

myasnikovdaniil mentioned this pull request Apr 14, 2026

Move E2E tests out of PR pipeline to release/nightly #2397

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tests] Fix Kafka E2E test timeout and retry race condition#2358

[tests] Fix Kafka E2E test timeout and retry race condition#2358
lexfrei merged 1 commit intomainfrom
fix/kafka-e2e-timeout

lexfrei commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lexfrei commented Apr 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lexfrei commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading