Skip to content

ci(rate-limit): wait for etherpad readiness before running test#7726

Merged
JohnMcLear merged 2 commits into
developfrom
fix/rate-limit-wait-etherpad
May 11, 2026
Merged

ci(rate-limit): wait for etherpad readiness before running test#7726
JohnMcLear merged 2 commits into
developfrom
fix/rate-limit-wait-etherpad

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

@JohnMcLear JohnMcLear commented May 11, 2026

Summary

  • Adds a readiness poll on `http://127.0.0.1:8081/\` between `run docker images` and the test step in `.github/workflows/rate-limit.yml`.
  • Eliminates the race where the etherpad container started with `&` isn't yet listening when the test fires, causing nginx to return 502 and the workflow to fail.

Why

Developments three most recent runs of this workflow — all on README-only commits (#7723, #7724, #7725) — failed with the same 502 from nginx upstream. PR #7721 also tripped it three reruns in a row. With a warm pnpm-store, `pnpm install --frozen-lockfile` finishes faster than etherpad can boot inside the container, so the test starts before the upstream is responsive. Polling until nginx stops returning 5xx makes the run deterministic.

Test plan

  • CI's `rate limit` job passes on this PR (the failure mode it fixes).
  • Polling step output shows "etherpad is ready behind nginx" within ~30s on a normal run.
  • If etherpad fails to come up, the new step prints `docker logs etherpad-docker` so future failures are easier to triage.

🤖 Generated with Claude Code

The workflow starts the etherpad container in the background, then runs
`pnpm install`, then runs the test. On a warm pnpm-store the install can
finish before etherpad is listening on 9001, at which point nginx returns
502 for the test request and the run fails. Recent README-only commits
on develop hit this race on three consecutive runs.

Poll the nginx-proxied endpoint (port 8081 — also what the test uses)
until it stops returning 5xx, with a 2-minute timeout and `docker logs
etherpad-docker` on giving up to make diagnosis straightforward.
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Wait for etherpad readiness before running rate-limit test

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Adds readiness polling for etherpad container before running tests
• Prevents race condition where nginx returns 502 errors
• Polls nginx-proxied endpoint until responsive with 2-minute timeout
• Outputs docker logs on failure for easier troubleshooting
Diagram
flowchart LR
  A["Start etherpad container"] --> B["Run pnpm install"]
  B --> C["Poll nginx endpoint"]
  C --> D{Ready?}
  D -->|Yes| E["Run rate-limit test"]
  D -->|No| F["Output docker logs"]
  F --> G["Fail workflow"]
Loading

Grey Divider

File Changes

1. .github/workflows/rate-limit.yml 🐞 Bug fix +19/-0

Add etherpad readiness polling step to workflow

• Adds new workflow step to poll etherpad readiness via nginx endpoint
• Implements 60-iteration loop with 2-second intervals and 2-second curl timeout
• Outputs diagnostic docker logs if etherpad fails to become ready
• Prevents test execution until upstream is responsive

.github/workflows/rate-limit.yml


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 11, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (1) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. No regression test for race 📘 Rule violation ☼ Reliability
Description
This PR fixes a CI race condition in .github/workflows/rate-limit.yml but does not add a
regression test/guard that would deterministically fail if the readiness wait were removed or
regressed. Without a regression test, the same 502/flakiness can be reintroduced without being
reliably caught during review.
Code

.github/workflows/rate-limit.yml[R66-84]

+      -
+        # The etherpad container is started in the background above; without
+        # this wait the test step can hit nginx before etherpad is listening
+        # and nginx returns 502, failing the run on a cold cache. Poll the
+        # nginx-proxied endpoint (which is also what the test hits) until it
+        # stops returning 5xx.
+        name: Wait for etherpad behind nginx to be ready
+        run: |
+          for i in $(seq 1 60); do
+            if curl -fsS -o /dev/null --max-time 2 http://127.0.0.1:8081/; then
+              echo "etherpad is ready behind nginx"
+              exit 0
+            fi
+            sleep 2
+          done
+          echo "ERROR: etherpad behind nginx did not become ready in time"
+          echo "--- etherpad-docker logs ---"
+          docker logs etherpad-docker || true
+          exit 1
Evidence
Compliance ID 1 requires a regression test for each bug fix; the diff shows only a workflow change
that implements the fix (readiness polling) and no accompanying regression test/guard beyond the
workflow itself to ensure the fix cannot be removed/regressed without being caught.

.github/workflows/rate-limit.yml[66-84]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The PR fixes a CI race by adding a readiness poll, but it does not add a regression test/guard that would reliably fail if the fix were reverted.
## Issue Context
Because this is a workflow-level bug fix, reintroducing the race may only show up as intermittent 502 failures rather than a deterministic test failure.
## Fix Focus Areas
- .github/workflows/rate-limit.yml[66-84]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Missing nginx logs on timeout ✓ Resolved 🐞 Bug ◔ Observability
Description
On timeout the step only prints etherpad-docker logs, but the polled endpoint is the nginx
container on 8081 and that container is started without a name. If nginx is the problem, the
workflow won’t capture its logs, slowing triage.
Code

.github/workflows/rate-limit.yml[R81-83]

+          echo "ERROR: etherpad behind nginx did not become ready in time"
+          echo "--- etherpad-docker logs ---"
+          docker logs etherpad-docker || true
Evidence
The workflow starts nginx detached on port 8081 without a name, while the new timeout handler only
prints etherpad logs; nginx is also the reverse-proxy entrypoint for all requests in this test
setup.

.github/workflows/rate-limit.yml[58-84]
src/tests/ratelimit/nginx.conf[6-20]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The timeout handler dumps only etherpad logs, but failures at `http://127.0.0.1:8081/` can be caused by nginx itself (crash, bad config, network). Because nginx is started without `--name`, it’s awkward to reliably capture its logs during failure.
## Issue Context
Nginx is the container bound to host port 8081 and proxies to etherpad.
## Fix Focus Areas
- .github/workflows/rate-limit.yml[58-84]
## Suggested fix
1) Start nginx with a stable name, e.g. `docker run --name nginx-docker -p 8081:80 ... -d nginx-latest`.
2) On readiness timeout, print both:
- `docker logs nginx-docker || true`
- `docker logs etherpad-docker || true`
Optionally add `docker ps -a` for quick container state visibility.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

3. Wait can reach 4m ✓ Resolved 🐞 Bug ➹ Performance
Description
The readiness poll can delay job failure up to ~240s because each of the 60 iterations includes a
curl with --max-time 2 plus sleep 2. This unnecessarily prolongs CI runs when etherpad never
becomes ready.
Code

.github/workflows/rate-limit.yml[R74-80]

+          for i in $(seq 1 60); do
+            if curl -fsS -o /dev/null --max-time 2 http://127.0.0.1:8081/; then
+              echo "etherpad is ready behind nginx"
+              exit 0
+            fi
+            sleep 2
+          done
Evidence
The loop runs 60 times, sleeps 2 seconds each iteration, and allows each curl attempt to take up to
2 seconds, so the worst-case time is additive across both delays.

.github/workflows/rate-limit.yml[66-84]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The readiness loop’s wall-clock upper bound is ~4 minutes (60 * (2s curl timeout + 2s sleep)), which can slow down failure feedback.
## Issue Context
This is the new “Wait for etherpad behind nginx to be ready” step.
## Fix Focus Areas
- .github/workflows/rate-limit.yml[66-84]
## Suggested fix
Use a single wall-clock timeout (e.g., `timeout 120s bash -c 'until curl ...; do sleep 2; done'`) or reduce either the per-try curl timeout or the sleep interval so the maximum wait matches your intended bound (for example, 120s total). Also consider printing the final HTTP status/connection error to make timeouts easier to interpret.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

- Name the nginx container so its logs can be captured when the readiness
  poll times out — previously nginx was started anonymously and a
  502 caused by nginx itself (rather than etherpad) would have been
  hard to diagnose from the workflow log alone.
- On timeout also dump `docker ps -a` for container-state visibility.
- Tighten the readiness wait: 30 iterations × (1s curl timeout + 1s
  sleep) gives ~60s budget instead of ~240s, which is still well above
  observed cold-start time and keeps the failure-fast contract.
@JohnMcLear
Copy link
Copy Markdown
Member Author

Addressed Qodo's review in c38e5ea:

  • Missing nginx logs on timeout — nginx container is now named (`--name nginx-docker`) and its logs are dumped on readiness timeout, along with `docker ps -a` for container-state visibility.
  • Wait can reach 4m — tightened to 30 iterations × (1s curl timeout + 1s sleep) ≈ 60s budget. Still well above observed cold-start time, but fails-fast when something is genuinely wrong.
  • No regression test — pushing back on this one. The failure mode is "test runs before container is ready," which is inherently a CI-orchestration race — there is no realistic deterministic regression test for it short of running this very workflow. The workflow itself, which was failing on three consecutive develop commits prior to this PR, is the regression signal. Adding a synthetic guard that races a sleep against the same poll would just be testing the test.

@JohnMcLear JohnMcLear merged commit 3113c50 into develop May 11, 2026
43 checks passed
@JohnMcLear JohnMcLear deleted the fix/rate-limit-wait-etherpad branch May 11, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant