test(ci): fix flaky tests#13332
Merged
shreemaan-abhishek merged 6 commits intoMay 7, 2026
Merged
Conversation
Fix four flakes that survived apache#13266, identified by re-running the cross-branch flake analysis on CI failures from the past week: - t/cli/test_etcd_sync_event_handle.sh "Round 2 Request 1 unexpected": fixed `sleep 5` was too short on slow runners — etcd auth-toggle + watch-reconnect + bulk-event-apply can take longer than 5s, so curl hits the stale route. Replaced with a 30s deadline poll on /1 until status 204 (the new fault-injection plugin) appears. - t/core/config_etcd.t TEST 10 "timeout when waiting for the process to exit": the test calls test_sync_data which fires init_watch_ctx, leaving a run_watch background timer alive; on slow runners nginx shutdown exceeds the default 3s kill-wait. Bumped TEST_NGINX_TIMEOUT to 30 for the file. - t/admin/plugins-reload.t TEST 1, TEST 2 "grep_error_log_out empty": the post-reload "sync local conf to etcd" log doesn't always land within ngx.sleep(1). Bumped to ngx.sleep(2). - t/discovery/eureka.t TEST 4 "failed to fetch registry from 127.0.0.1:20997 should match": added `--- wait: 2` so the eureka fetch interval (1s) has a chance to fire before the grep runs.
The previous bump was on the wrong sleep. The flake symptom is the "reload plugins on node before reload" log line missing from the grep_error_log_out diff — i.e. the initial filter fire (which logs "before reload" while before_reload=true) didn't land before the test flipped before_reload=false. The fix is to bump the sleep BETWEEN core.config.new and the before_reload=false flip, not the final sleep after the reload PUT.
The bumped ngx.sleep calls in TEST 1/2 push handler time over the default 3s socket timeout from Test::Nginx::Util::$Timeout, causing prove to fail with "ERROR: client socket timed out". Add `--- timeout: 10` to both tests so the test client waits long enough for the extended sleep windows to complete.
membphis
previously approved these changes
May 6, 2026
AlinsRan
previously approved these changes
May 6, 2026
nic-6443
previously approved these changes
May 6, 2026
Same fix shape as t/core/config_etcd.t in this PR — tests that exercise etcd-watcher background timers can hit Test::Nginx's default 3s process-exit kill-wait on slow runners. The EE counterpart of this file has tripped this in CI; the same race exists in apisix and is a free pre-emptive fix (no test slowdown — env var only affects shutdown).
41cfc25
The test runs with workers(4) and the patched startBackendTimer only logs "start skywalking backend timer" when ngx.worker.id()==0. With 12 keepalive=false connections and 4 workers, the ~3% probability that no connection lands on worker 0 produced a recurring flake — observed on master + 3 unrelated branches in the past 8 days. Bumping to 50 drops the miss probability to (3/4)^50 ≈ 0.0006%. No semantic change.
The previous commit bumped the request loop from 12 to 50, but 50 sequential keepalive=false round-trips on CI runners can take longer than the default 3s test-client socket timeout, causing "ERROR: client socket timed out". Add `--- timeout: 10` so the test client waits long enough for the extended handler to complete.
AlinsRan
approved these changes
May 6, 2026
nic-6443
approved these changes
May 6, 2026
Baoyuantop
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Follow-up to #13266. After that PR landed I re-ran the cross-branch CI flake analysis on the past week of failures (apache/apisix master + several feature branches) and four flakes survived. This PR fixes them.
t/cli/test_etcd_sync_event_handle.shfailed: Round 2 Request 1 unexpected(curl/1returned 503/stale route instead of 204)sleep 5; curl /1; ...with a 30s deadline poll on/1until the newfault-injectionplugin is appliedt/core/config_etcd.tTEST 10timeout when waiting for the process N to exit at Test/Nginx/Util.pm line 683TEST_NGINX_TIMEOUTto 30 for the file (test leaves arun_watchbackground timer; default 3s kill-wait is too short on slow runners)t/admin/plugins-reload.tTEST 1 + TEST 2grep_error_log_outempty forsync local conf to etcd/reload plugins on node before reloadngx.sleep(1)→ngx.sleep(2)(post-reload sync log occasionally lands after the 1s grace)t/discovery/eureka.tTEST 4error_logpatternfailed to fetch registry from 127.0.0.1:20997: status=502not matched--- wait: 2so the 1s eurekafetch_intervalhas time to fire before the grepWhy these flakes weren't caught in #13266
The previous bundle was scoped to test-files that flaked at high frequency on master before the fix; these four either flaked in branches not yet sampled, or showed up only after the prior fixes landed and freed up other slow paths. Each one has been seen on master/main + at least one feature branch within the past 7 days.
Which issue(s) this PR fixes:
Fixes #
Checklist