Skip to content

test(ci): fix flaky tests#13332

Merged
shreemaan-abhishek merged 6 commits into
apache:masterfrom
shreemaan-abhishek:fix/cli-flake-test-etcd-sync
May 7, 2026
Merged

test(ci): fix flaky tests#13332
shreemaan-abhishek merged 6 commits into
apache:masterfrom
shreemaan-abhishek:fix/cli-flake-test-etcd-sync

Conversation

@shreemaan-abhishek
Copy link
Copy Markdown
Contributor

Description

Follow-up to #13266. After that PR landed I re-ran the cross-branch CI flake analysis on the past week of failures (apache/apisix master + several feature branches) and four flakes survived. This PR fixes them.

Test Symptom Fix
t/cli/test_etcd_sync_event_handle.sh failed: Round 2 Request 1 unexpected (curl /1 returned 503/stale route instead of 204) Replace sleep 5; curl /1; ... with a 30s deadline poll on /1 until the new fault-injection plugin is applied
t/core/config_etcd.t TEST 10 timeout when waiting for the process N to exit at Test/Nginx/Util.pm line 683 Bump TEST_NGINX_TIMEOUT to 30 for the file (test leaves a run_watch background timer; default 3s kill-wait is too short on slow runners)
t/admin/plugins-reload.t TEST 1 + TEST 2 grep_error_log_out empty for sync local conf to etcd / reload plugins on node before reload ngx.sleep(1)ngx.sleep(2) (post-reload sync log occasionally lands after the 1s grace)
t/discovery/eureka.t TEST 4 error_log pattern failed to fetch registry from 127.0.0.1:20997: status=502 not matched Add --- wait: 2 so the 1s eureka fetch_interval has time to fire before the grep

Why these flakes weren't caught in #13266

The previous bundle was scoped to test-files that flaked at high frequency on master before the fix; these four either flaked in branches not yet sampled, or showed up only after the prior fixes landed and freed up other slow paths. Each one has been seen on master/main + at least one feature branch within the past 7 days.

Which issue(s) this PR fixes:

Fixes #

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible

Fix four flakes that survived apache#13266, identified by re-running the
cross-branch flake analysis on CI failures from the past week:

- t/cli/test_etcd_sync_event_handle.sh "Round 2 Request 1 unexpected":
  fixed `sleep 5` was too short on slow runners — etcd auth-toggle +
  watch-reconnect + bulk-event-apply can take longer than 5s, so curl
  hits the stale route. Replaced with a 30s deadline poll on /1 until
  status 204 (the new fault-injection plugin) appears.
- t/core/config_etcd.t TEST 10 "timeout when waiting for the process to
  exit": the test calls test_sync_data which fires init_watch_ctx, leaving
  a run_watch background timer alive; on slow runners nginx shutdown
  exceeds the default 3s kill-wait. Bumped TEST_NGINX_TIMEOUT to 30 for
  the file.
- t/admin/plugins-reload.t TEST 1, TEST 2 "grep_error_log_out empty":
  the post-reload "sync local conf to etcd" log doesn't always land
  within ngx.sleep(1). Bumped to ngx.sleep(2).
- t/discovery/eureka.t TEST 4 "failed to fetch registry from
  127.0.0.1:20997 should match": added `--- wait: 2` so the eureka
  fetch interval (1s) has a chance to fire before the grep runs.
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. CI labels May 6, 2026
@shreemaan-abhishek shreemaan-abhishek changed the title test(ci): fix flaky tests (follow-up to #13266) test(ci): fix flaky tests May 6, 2026
The previous bump was on the wrong sleep. The flake symptom is the
"reload plugins on node before reload" log line missing from the
grep_error_log_out diff — i.e. the initial filter fire (which logs
"before reload" while before_reload=true) didn't land before the test
flipped before_reload=false. The fix is to bump the sleep BETWEEN
core.config.new and the before_reload=false flip, not the final sleep
after the reload PUT.
The bumped ngx.sleep calls in TEST 1/2 push handler time over the
default 3s socket timeout from Test::Nginx::Util::$Timeout, causing
prove to fail with "ERROR: client socket timed out". Add `--- timeout:
10` to both tests so the test client waits long enough for the
extended sleep windows to complete.
membphis
membphis previously approved these changes May 6, 2026
AlinsRan
AlinsRan previously approved these changes May 6, 2026
nic-6443
nic-6443 previously approved these changes May 6, 2026
Same fix shape as t/core/config_etcd.t in this PR — tests that exercise
etcd-watcher background timers can hit Test::Nginx's default 3s
process-exit kill-wait on slow runners. The EE counterpart of this file
has tripped this in CI; the same race exists in apisix and is a free
pre-emptive fix (no test slowdown — env var only affects shutdown).
@shreemaan-abhishek shreemaan-abhishek dismissed stale reviews from nic-6443, AlinsRan, and membphis via 41cfc25 May 6, 2026 05:58
The test runs with workers(4) and the patched startBackendTimer only
logs "start skywalking backend timer" when ngx.worker.id()==0. With 12
keepalive=false connections and 4 workers, the ~3% probability that no
connection lands on worker 0 produced a recurring flake — observed on
master + 3 unrelated branches in the past 8 days. Bumping to 50 drops
the miss probability to (3/4)^50 ≈ 0.0006%. No semantic change.
The previous commit bumped the request loop from 12 to 50, but 50
sequential keepalive=false round-trips on CI runners can take longer
than the default 3s test-client socket timeout, causing
"ERROR: client socket timed out". Add `--- timeout: 10` so the test
client waits long enough for the extended handler to complete.
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels May 6, 2026
@shreemaan-abhishek shreemaan-abhishek merged commit e32ed3e into apache:master May 7, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants