You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're seeing real cost + noise from our CI pipeline that's worth tightening before the repo goes public. This issue bundles all the CI-hygiene work and the post-#69 cleanup tasks into a single follow-up PR that, once merged, lets us re-add Lint/Test/Integration Tests to the ruleset's required status checks.
Scope updated 2026-04-23 to bundle the golangci-lint flake fix, dropping pull_request from the dual-trigger workflows, and SDK + E2E test coverage into one PR. Reason: all these changes are small workflow tweaks, related by concern, low-risk, and Taite isn't online for a few hours — so one larger PR is more efficient than splitting.
Scope (all bundled into one PR that closes this issue)
1. Fix TestDLQIntegration flakiness (root-cause, not retry-cover)
ClickHouse handshake / connection reset by peer (run 24805317816)
Both point at the same root cause: testcontainer readiness race for ClickHouse. Most likely causes:
Process ready vs. connection ready timing on startup
Port contention / cleanup race with prior tests
Missing t.Helper() or t.Cleanup() discipline
Acceptance:TestDLQIntegration runs 10× locally and 10× in CI without a single failure.
2. Gate claude-review.yml on CI pass/skip
Currently Claude review fires on pull_request: opened / synchronize regardless of CI state — tokens get spent on PRs the human will just bounce back as "come back when CI is green."
Implementation options:
(A) Switch trigger to workflow_run: { workflows: [CI], types: [completed] } + check conclusion in a first step. Clean semantics; trade-off is workflow_run has reduced default permissions (need to explicitly grant pull-requests: write).
(B) Keep pull_request trigger but add a first-step guard that queries gh api .../check-runs and exits 0 if any required check is failure / cancelled. Simpler permission model, slight latency cost (workflow spins up a runner just to decide to skip).
(C) Both: gate + sticky comment explaining "CI red; AI review skipped until green."
Acceptance: Claude review doesn't post on PRs whose required CI checks are failing. Post-fix re-push that goes green triggers review normally.
Out of scope: Gemini (managed App) and Copilot (per-seat).
golangci-lint-action runs golangci-lint config verify by default, which fetches a schema from golangci-lint.run. When that service is slow or the runner's network blips, Lint fails for reasons unrelated to our Go code.
Fix: pass verify: false to the golangci-lint-action step in ci.yml. Skips the schema-validate pre-flight; the actual linter run is unaffected. One-line change.
4. Drop pull_request trigger from pr-title.yml + label.yml
Those workflows were temporarily dual-triggered (both pull_request AND pull_request_target) in #69 as a transition. Now that pull_request_target is on main and confirmed to fire, drop the pull_request trigger. Stops running each check twice on internal PRs, removes the sticky-comment TOCTOU race Claude flagged.
Also: revert the concurrency-group ${{ github.event_name }} suffix once there's only one trigger — no longer needed.
5. Add SDK + E2E test jobs to CI
Gap surfaced by #63 (Dependabot SDK bump): make test-sdk and make test-e2e exist in the Makefile but aren't run by ci.yml. An SDK-only PR currently runs zero TypeScript tests automatically.
Proposed additions:
sdk-test job: runs on pull_request with paths: [clients/ts/**, tests/sdk/**]. Executes make test-sdk. Fast (~30s).
e2e job: runs on pull_request with paths covering the full ingest surface (internal/ingest/**, internal/api/**, clients/ts/**, tests/sdk/**). Executes make test-e2e. Slower (starts Docker containers), gate behind a label (e.g. run-e2e) or only on ready_for_review to manage runner cost.
Acceptance: SDK bump PRs exercise the SDK unit tests. PRs touching the ingest → CH → query path can optionally trigger E2E via label.
6. Smarter CI (lower priority — pick what fits)
Grab-bag of ideas:
Path-based CI skips via paths: / paths-ignore: on ci.yml. Docs-only PRs shouldn't run Go integration tests. Already partially possible; audit and expand.
Test parallelization via gotestsum --packages with explicit splits, or go test -parallel N. Could reduce Test job from ~4m → ~2m.
Cache warming: actions/cache for Go modules + build cache.
Change-aware test selection via go list -deps: only run Go tests for packages whose files (or transitive deps) changed. Probably over-engineering for our pace.
Once this PR merges — ruleset update
After verification on main + at least 3 clean CI runs, re-add these to the ruleset's required_status_checks:
Lint
Test
Integration Tests
Single gh api PUT /repos/Wave-RF/WaveHouse/rulesets/15353356 call.
Context
We're seeing real cost + noise from our CI pipeline that's worth tightening before the repo goes public. This issue bundles all the CI-hygiene work and the post-#69 cleanup tasks into a single follow-up PR that, once merged, lets us re-add
Lint/Test/Integration Teststo the ruleset's required status checks.Scope updated 2026-04-23 to bundle the golangci-lint flake fix, dropping
pull_requestfrom the dual-trigger workflows, and SDK + E2E test coverage into one PR. Reason: all these changes are small workflow tweaks, related by concern, low-risk, and Taite isn't online for a few hours — so one larger PR is more efficient than splitting.Scope (all bundled into one PR that closes this issue)
1. Fix
TestDLQIntegrationflakiness (root-cause, not retry-cover)Two observed failure modes:
"Received unexpected error"(run 24803345618)connection reset by peer(run 24805317816)Both point at the same root cause: testcontainer readiness race for ClickHouse. Most likely causes:
t.Helper()ort.Cleanup()disciplineAcceptance:
TestDLQIntegrationruns 10× locally and 10× in CI without a single failure.2. Gate
claude-review.ymlon CI pass/skipCurrently Claude review fires on
pull_request: opened/synchronizeregardless of CI state — tokens get spent on PRs the human will just bounce back as "come back when CI is green."Implementation options:
workflow_run: { workflows: [CI], types: [completed] }+ check conclusion in a first step. Clean semantics; trade-off isworkflow_runhas reduced default permissions (need to explicitly grant pull-requests: write).pull_requesttrigger but add a first-step guard that queriesgh api .../check-runsand exits 0 if any required check isfailure/cancelled. Simpler permission model, slight latency cost (workflow spins up a runner just to decide to skip).Acceptance: Claude review doesn't post on PRs whose required CI checks are failing. Post-fix re-push that goes green triggers review normally.
Out of scope: Gemini (managed App) and Copilot (per-seat).
3. Fix
golangci-lint config verifyexternal-schema flakeObserved multiple times on main and PRs:
golangci-lint-actionrunsgolangci-lint config verifyby default, which fetches a schema fromgolangci-lint.run. When that service is slow or the runner's network blips, Lint fails for reasons unrelated to our Go code.Fix: pass
verify: falseto thegolangci-lint-actionstep inci.yml. Skips the schema-validate pre-flight; the actual linter run is unaffected. One-line change.4. Drop
pull_requesttrigger frompr-title.yml+label.ymlThose workflows were temporarily dual-triggered (both
pull_requestANDpull_request_target) in #69 as a transition. Now thatpull_request_targetis on main and confirmed to fire, drop thepull_requesttrigger. Stops running each check twice on internal PRs, removes the sticky-comment TOCTOU race Claude flagged.Also: revert the concurrency-group
${{ github.event_name }}suffix once there's only one trigger — no longer needed.5. Add SDK + E2E test jobs to CI
Gap surfaced by #63 (Dependabot SDK bump):
make test-sdkandmake test-e2eexist in the Makefile but aren't run byci.yml. An SDK-only PR currently runs zero TypeScript tests automatically.Proposed additions:
sdk-testjob: runs onpull_requestwithpaths: [clients/ts/**, tests/sdk/**]. Executesmake test-sdk. Fast (~30s).e2ejob: runs onpull_requestwithpathscovering the full ingest surface (internal/ingest/**,internal/api/**,clients/ts/**,tests/sdk/**). Executesmake test-e2e. Slower (starts Docker containers), gate behind a label (e.g.run-e2e) or only onready_for_reviewto manage runner cost.Acceptance: SDK bump PRs exercise the SDK unit tests. PRs touching the ingest → CH → query path can optionally trigger E2E via label.
6. Smarter CI (lower priority — pick what fits)
Grab-bag of ideas:
paths:/paths-ignore:onci.yml. Docs-only PRs shouldn't run Go integration tests. Already partially possible; audit and expand.gotestsum --packageswith explicit splits, orgo test -parallel N. Could reduceTestjob from ~4m → ~2m.gotestsum --rerun-fails=2 --packages ./...intest-integration. Belt-and-suspenders after ci: bump golangci/golangci-lint-action from 7 to 9 #1 root-cause work; not a substitute.actions/cachefor Go modules + build cache.go list -deps: only run Go tests for packages whose files (or transitive deps) changed. Probably over-engineering for our pace.Once this PR merges — ruleset update
After verification on main + at least 3 clean CI runs, re-add these to the ruleset's
required_status_checks:LintTestIntegration TestsSingle
gh api PUT /repos/Wave-RF/WaveHouse/rulesets/15353356call.References
.github/workflows/ci.yml.github/workflows/claude-review.yml.github/workflows/pr-title.yml,.github/workflows/label.ymlOut of scope (tracked separately)
— Posted by Claude Code on behalf of @EricAndrechek